Re: From wchar_t to char32

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: From wchar_t to char32_t

From:	Bruno Haible
Subject:	Re: From wchar_t to char32_t
Date:	Mon, 10 Jul 2023 17:10:34 +0200

Regarding my proposed 'dfa' module patch:
Paul Eggert wrote on 2023-07-04:
> > -      wchar_t wch;
> > -      size_t nbytes = mbrtowc (&wch, s, n, &d->mbs);
> > +      char32_t wch;
> > +      size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs);
> >        if (0 < nbytes && nbytes < (size_t) -2)
> >          {
> >            *pwc = wch;
> > +          if (nbytes == (size_t) -3)
> > +            nbytes = 0;
> >            return nbytes;
> 
> That last change doesn't match the comment for the mbs_to_wchar 
> function, which says that the function always returns a positive int. 
> Callers depend on this.

Indeed, the function fetch_wc and its callers expect that the 'wcstok'
field contains an integer that represents the multibyte sequence as a whole.

Fundamentally the problem is that in a character range in a regex
  [ MB1 MB2 ... ]
the multibyte sequence boundaries are also the parse boundaries. If MB1
gets transformed to two Unicode characters U1 U2, the character range
  [ U1 U2 MB2 ... ]
is something very different.

So, in locales where the locale encoding is BIG5-HKSCS we would have a
problem. We would need to distinguish application uses where it is OK
to split a character into several Unicode characters (such as for
computing the total width — mbswidth.c) and application uses where
the multibyte character must be kept together, with a Unicode-side
representation of several Unicode characters.

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

Re: From wchar_t to char32_t, Bruno Haible, 2023/07/01
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/01
  - Re: From wchar_t to char32_t, arnold, 2023/07/02
  - Re: From wchar_t to char32_t, Jim Meyering, 2023/07/04
  - Re: From wchar_t to char32_t, Paul Eggert, 2023/07/04
    - Re: From wchar_t to char32_t, Bruno Haible <=
    - Re: From wchar_t to char32_t, Bruno Haible, 2023/07/10
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/02
  - Re: From wchar_t to char32_t, Paul Eggert, 2023/07/02
    - Re: From wchar_t to char32_t, Bruno Haible, 2023/07/02
    - Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
    - Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
    - Re: From wchar_t to char32_t, Bruno Haible, 2023/07/03
    - Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
    - Re: From wchar_t to char32_t, Bruno Haible, 2023/07/04
    - Re: From wchar_t to char32_t, Paul Eggert, 2023/07/04

Prev by Date: Re: From wchar_t to char32_t
Next by Date: Re: c32width: protect against bad configure args
Previous by thread: Re: From wchar_t to char32_t
Next by thread: Re: From wchar_t to char32_t
Index(es):
- Date
- Thread