[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: From wchar_t to char32_t
From: |
Bruno Haible |
Subject: |
Re: From wchar_t to char32_t |
Date: |
Mon, 10 Jul 2023 17:10:34 +0200 |
Regarding my proposed 'dfa' module patch:
Paul Eggert wrote on 2023-07-04:
> > - wchar_t wch;
> > - size_t nbytes = mbrtowc (&wch, s, n, &d->mbs);
> > + char32_t wch;
> > + size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs);
> > if (0 < nbytes && nbytes < (size_t) -2)
> > {
> > *pwc = wch;
> > + if (nbytes == (size_t) -3)
> > + nbytes = 0;
> > return nbytes;
>
> That last change doesn't match the comment for the mbs_to_wchar
> function, which says that the function always returns a positive int.
> Callers depend on this.
Indeed, the function fetch_wc and its callers expect that the 'wcstok'
field contains an integer that represents the multibyte sequence as a whole.
Fundamentally the problem is that in a character range in a regex
[ MB1 MB2 ... ]
the multibyte sequence boundaries are also the parse boundaries. If MB1
gets transformed to two Unicode characters U1 U2, the character range
[ U1 U2 MB2 ... ]
is something very different.
So, in locales where the locale encoding is BIG5-HKSCS we would have a
problem. We would need to distinguish application uses where it is OK
to split a character into several Unicode characters (such as for
computing the total width — mbswidth.c) and application uses where
the multibyte character must be kept together, with a Unicode-side
representation of several Unicode characters.
Bruno
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/01
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/01
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/02
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/02
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/02
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/03
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/04
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/04