[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Error UTF-8 strings
From: |
Akim Demaille |
Subject: |
Re: Error UTF-8 strings |
Date: |
Mon, 22 Jun 2020 07:59:18 +0200 |
> Le 21 juin 2020 à 15:24, Hans Åberg <haberg-1@telia.com> a écrit :
>
>
>> On 21 Jun 2020, at 14:25, Hans Åberg <haberg-1@telia.com> wrote:
>>
>>> On 21 Jun 2020, at 11:45, Akim Demaille <akim@lrde.epita.fr> wrote:
>>>
>>> What locale are you using?
>>
>> LC_CTYPE=UTF-8
>
> The error goes away if setting LC_CTYPE=en_US.UTF-8 before recompiling the
> .yy file.
>
> UTF-8 is language independent, so MacOS uses LC_CTYPE=UTF-8, but there are
> software that require a prefix.
Hans,
This is double-escaping of the UTF-8 characters is a well known problem
of parse.error=verbose, that resulted in the introduction of "detailed"
parse.error. That was discussed extensively on Bison's lists, and is
documented in NEWS of 3.6:
*** Improved syntax error messages
Two new values for the %define parse.error variable offer more control to
the user. Available in all the skeletons (C, C++, Java).
**** %define parse.error detailed
The behavior of "%define parse.error detailed" is closely resembling that
of "%define parse.error verbose" with a few exceptions. First, it is safe
to use non-ASCII characters in token aliases (with 'verbose', the result
depends on the locale with which bison was run). Second, a yysymbol_name
function is exposed to the user, instead of the yytnamerr function and the
yytname table. Third, token internationalization is supported (see
below).
Besides, I have recently posted that Bison 3.7 will also make another step:
*** String aliases are faithfully propagated
Bison used to interpret user strings (i.e., decoding backslash escapes)
when reading them, and to escape them (i.e., issue non-printable
characters as backslash escapes, taking the locale into account) when
outputting them. As a consequence non-ASCII strings (say in UTF-8) ended
up "ciphered" as sequences of backslash escapes. This happened not only
in the generated sources (where the compiler will reinterpret them), but
also in all the generated reports (text, xml, html, dot, etc.). Reports
were therefore not readable when string aliases were not pure ASCII.
Worse yet: the output depended on the user's locale.
Now Bison faithfully treats the string aliases exactly the way the user
spelled them. This fixes all the aforementioned problems. However, now,
string aliases semantically equivalent but syntactically different (e.g.,
"A", "\x41", "\101") are considered to be different.
So, there is no new bug in 3.6 here, just something that is well known for
ages, about which you and I already discussed.
- Error UTF-8 strings, Hans Åberg, 2020/06/20
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/21
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/21
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/21
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/21
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/21
- Re: Error UTF-8 strings,
Akim Demaille <=
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/22
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/23
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/23
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/23
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/24
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/24
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/23
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/24
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/24
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/24