Re: [lmi] Replacing the check for Latin-1 with check for UTF-8

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Replacing the check for Latin-1 with check for UTF-8

From:	Vadim Zeitlin
Subject:	Re: [lmi] Replacing the check for Latin-1 with check for UTF-8
Date:	Mon, 31 May 2021 23:35:30 +0200

On Mon, 31 May 2021 20:12:53 +0000 Greg Chicares <gchicares@sbcglobal.net> 
wrote:

GC> On 5/31/21 6:41 PM, Vadim Zeitlin wrote:
GC> > 
GC> >  Rewriting the test in assay_non_latin() in test_coding_rules.cpp using
GC> > std::regex without any further changes doesn't work because char is signed
GC> > by default making the range \x7f-\x9f used in the regex used there 
invalid,
GC> 
GC> Wow. Thanks for finding that. I'm especially surprised because,
GC> IIRC, that test has detected non-Latin characters successfully.
GC> (Perhaps boost::regex made it work somehow?)

 I think it did, but, to be honest, I didn't check it. Worse, I'm almost
sure that the 3 year old version of std::regex handled this too, as I
didn't have this problem when I tested the changes back then, but this
is not the case for libstdc++ used by gcc 11 any more.

GC> > as it's specified in the wrong direction (\x7f is 127, but 0x9f is -97).
GC> > I'm not immediately sure whether we can use basic_regex<unsigned char> or
GC> > maybe basic_regex<char8_t> and while I'm sure some solution to this could
GC> > be found, I wonder if we could just remove this function entirely and,
GC> 
GC> I was planning to remove it entirely. I hadn't intended to replace
GC> it with anything, but that sounds like a good idea now that you
GC> mention it...
GC> 
GC> > perhaps, replace it by another one, not using regular expressions, that
GC> > verifies that a file is properly UTF-8 encoded (which includes all files
GC> > containing ASCII characters only, i.e. all the current sources).

[BTW, let me correct another wrong thing I wrote: some of the current
 sources do use non-ASCII characters (in comments) but they're still all
 correctly encoded in UTF-8, which is what really counts]

GC> How would we verify that a file is properly UTF-8 encoded?

 Oh, indeed, I forgot that now the situation with this is even _worse_
than previously in standard C++. Until C++17 we could use std::codecvt_utf8
which was unwieldy but at least could get the job done (I think, I've
hardly ever used it in practice), but since C++17 it's deprecated but no
replacement is provided.

GC> My first thought would be to use a one-liner such as:
GC>   for z in *; (iconv -f UTF-8 -t UTF-8 $z &>/dev/null || echo $z)
GC> Given that 'test_coding_rules.cpp' already reads the file in its
GC> entirety, it would be nice to have a function that would verify
GC> that it's valid UTF-8 (instead of using a shell command), but
GC> wouldn't that be rather non-trivial?

 It's not hard, my UTF8Decoder class naively implementing the algorithm
from the Unicode Standard fits into ~200 lines of code including comments
(not counting the tests), but it does seem a bit stupid to have to do it
again and again and again. It's also surely not the most efficient
implementation, there are libraries using SIMD that can be orders of
magnitude faster.

 So we definitely could do it, but I don't know if we should any more.

GC> >  Could we do this before replacing Boost.Regex?
GC> 
GC> Certainly.
GC> 
GC> Furthermore, we could simply remove the Latin test now, then
GC> replace boost::regex, and then, afterwards, add a UTF-8 test.

 I'm going to do it like this, thanks.

GC> Every April we import some external data that has in the past
GC> contained weird msw characters that we've had to convert, so
GC> we'd want to get that last step done before 2022-04.

 Thanks for setting a realistic challenge for me. I'll be slightly more
ambitious and will try to do it during the month starting in about 25
minutes in the local time zone. I don't promise to do it by tomorrow any
more however as I've already run into a weird problem with (std) regex not
matching when it should, so I'm going to have to debug and fix it first.

VZ

pgpHIQl10f_U6.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] Replacing the check for Latin-1 with check for UTF-8, Vadim Zeitlin, 2021/05/31
- Re: [lmi] Replacing the check for Latin-1 with check for UTF-8, Greg Chicares, 2021/05/31
  - Re: [lmi] Replacing the check for Latin-1 with check for UTF-8, Vadim Zeitlin <=

Prev by Date: Re: [lmi] Replacing the check for Latin-1 with check for UTF-8
Previous by thread: Re: [lmi] Replacing the check for Latin-1 with check for UTF-8
Index(es):
- Date
- Thread