bug-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Error UTF-8 strings


From: Akim Demaille
Subject: Re: Error UTF-8 strings
Date: Mon, 22 Jun 2020 07:59:18 +0200


> Le 21 juin 2020 à 15:24, Hans Åberg <haberg-1@telia.com> a écrit :
> 
> 
>> On 21 Jun 2020, at 14:25, Hans Åberg <haberg-1@telia.com> wrote:
>> 
>>> On 21 Jun 2020, at 11:45, Akim Demaille <akim@lrde.epita.fr> wrote:
>>> 
>>> What locale are you using?
>> 
>> LC_CTYPE=UTF-8
> 
> The error goes away if setting LC_CTYPE=en_US.UTF-8 before recompiling the 
> .yy file.
> 
> UTF-8 is language independent, so MacOS uses LC_CTYPE=UTF-8, but there are 
> software that require a prefix.

Hans,

This is double-escaping of the UTF-8 characters is a well known problem
of parse.error=verbose, that resulted in the introduction of "detailed"
parse.error.  That was discussed extensively on Bison's lists, and is
documented in NEWS of 3.6:



*** Improved syntax error messages

  Two new values for the %define parse.error variable offer more control to
  the user.  Available in all the skeletons (C, C++, Java).

**** %define parse.error detailed

  The behavior of "%define parse.error detailed" is closely resembling that
  of "%define parse.error verbose" with a few exceptions.  First, it is safe
  to use non-ASCII characters in token aliases (with 'verbose', the result
  depends on the locale with which bison was run).  Second, a yysymbol_name
  function is exposed to the user, instead of the yytnamerr function and the
  yytname table.  Third, token internationalization is supported (see
  below).



Besides, I have recently posted that Bison 3.7 will also make another step:



*** String aliases are faithfully propagated

  Bison used to interpret user strings (i.e., decoding backslash escapes)
  when reading them, and to escape them (i.e., issue non-printable
  characters as backslash escapes, taking the locale into account) when
  outputting them.  As a consequence non-ASCII strings (say in UTF-8) ended
  up "ciphered" as sequences of backslash escapes.  This happened not only
  in the generated sources (where the compiler will reinterpret them), but
  also in all the generated reports (text, xml, html, dot, etc.).  Reports
  were therefore not readable when string aliases were not pure ASCII.
  Worse yet: the output depended on the user's locale.

  Now Bison faithfully treats the string aliases exactly the way the user
  spelled them.  This fixes all the aforementioned problems.  However, now,
  string aliases semantically equivalent but syntactically different (e.g.,
  "A", "\x41", "\101") are considered to be different.



So, there is no new bug in 3.6 here, just something that is well known for
ages, about which you and I already discussed.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]