Re: [lmi] Replacing the check for Latin-1 with check for UTF-8

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Replacing the check for Latin-1 with check for UTF-8

From:	Greg Chicares
Subject:	Re: [lmi] Replacing the check for Latin-1 with check for UTF-8
Date:	Mon, 31 May 2021 20:12:53 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0

On 5/31/21 6:41 PM, Vadim Zeitlin wrote:
> 
>  Rewriting the test in assay_non_latin() in test_coding_rules.cpp using
> std::regex without any further changes doesn't work because char is signed
> by default making the range \x7f-\x9f used in the regex used there invalid,

Wow. Thanks for finding that. I'm especially surprised because,
IIRC, that test has detected non-Latin characters successfully.
(Perhaps boost::regex made it work somehow?)

> as it's specified in the wrong direction (\x7f is 127, but 0x9f is -97).
> I'm not immediately sure whether we can use basic_regex<unsigned char> or
> maybe basic_regex<char8_t> and while I'm sure some solution to this could
> be found, I wonder if we could just remove this function entirely and,

I was planning to remove it entirely. I hadn't intended to replace
it with anything, but that sounds like a good idea now that you
mention it...

> perhaps, replace it by another one, not using regular expressions, that
> verifies that a file is properly UTF-8 encoded (which includes all files
> containing ASCII characters only, i.e. all the current sources).

How would we verify that a file is properly UTF-8 encoded?
My first thought would be to use a one-liner such as:
  for z in *; (iconv -f UTF-8 -t UTF-8 $z &>/dev/null || echo $z)
Given that 'test_coding_rules.cpp' already reads the file in its
entirety, it would be nice to have a function that would verify
that it's valid UTF-8 (instead of using a shell command), but
wouldn't that be rather non-trivial?

>  Could we do this before replacing Boost.Regex?

Certainly.

Furthermore, we could simply remove the Latin test now, then
replace boost::regex, and then, afterwards, add a UTF-8 test.

Every April we import some external data that has in the past
contained weird msw characters that we've had to convert, so
we'd want to get that last step done before 2022-04.

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] Replacing the check for Latin-1 with check for UTF-8, Vadim Zeitlin, 2021/05/31
- Re: [lmi] Replacing the check for Latin-1 with check for UTF-8, Greg Chicares <=
  - Re: [lmi] Replacing the check for Latin-1 with check for UTF-8, Vadim Zeitlin, 2021/05/31

Prev by Date: [lmi] Replacing the check for Latin-1 with check for UTF-8
Next by Date: Re: [lmi] Replacing the check for Latin-1 with check for UTF-8
Previous by thread: [lmi] Replacing the check for Latin-1 with check for UTF-8
Next by thread: Re: [lmi] Replacing the check for Latin-1 with check for UTF-8
Index(es):
- Date
- Thread