[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: From wchar_t to char32_t
From: |
Bruno Haible |
Subject: |
Re: From wchar_t to char32_t |
Date: |
Mon, 10 Jul 2023 16:58:43 +0200 |
Paul Eggert wrote on 2023-07-06:
> in reviewing it found a minor
> glitch or two and some opportunities for simplification. I installed the
> attached further patch which I hope fixes glitches without breaking
> anything else.
Comments:
- Typo: s/mbrtoc23/mbrtoc32/
- The rationale for defining and initializing the mbstate_t at the function
scope was that on BSD and macOS systems, an mbstate_t is 128 bytes large,
thus the time to zero-initialize is not negligible. The code with
minimal-scope mbstate_t is clearer, but slower on BSD systems (assuming
a string with many switches between ASCII and non-ASCII characters).
OTOH, on a purely ASCII string, it's obviously faster to not initialize
an mbstate_t than to initialize it.
> One other thing I discovered in my review. POSIX says that 'diff' need
> not support locking-shift sequences[1], and this business of mbrtoc23
> returning (size_t) -3 is in a murky area as it would appear to fall into
> the locking-shift sequence category (at any rate, it doesn't appear to
> be a single-shift encoding which is POSIX's only other option for
> state-dependent encodings). Or maybe the next version of POSIX will have
> to change in this area?
> [1]:
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02
I think this wording regarding "single-shift" sequences and "locking-shift"
sequences is more than 20 years old:
- "single-shift" encodings are encodings such as EUC-JP. Before 2000, some
people viewed them as encodings with "shift". This paragraph is merely
a clarification that it's better to view these encodings as normal
multibyte encodings without shift.
- "locking-shift" encodings are things like ISO-2022-JP-2. Around 1999,
some people were experimenting with a hacked Linux libc that used
this encoding as a locale encoding. Of course, the resulting system
was full of bugs, because even simple operations such as concatenating
two directory names sometimes produced wrong results. And I'm not
even talking about the missing normalization of file names...
So, since 2000, there is an overall agreement that "locking-shift"
encodings are not usable as locale encodings. They are merely usable
with the 'iconv' facility.
POSIX does not have a term for the type of encoding that BIG5-HKSCS is,
where an (indivisible) multibyte-sequence maps to a sequence of 2 Unicode
characters.
Bruno
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, (continued)
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/22
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/27
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/28
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/26
- Re: From wchar_t to char32_t,
Bruno Haible <=
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/11
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/11
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/11
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/13
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/13
- Re: From wchar_t to char32_t, new module mbszero, Bruno Haible, 2023/07/16
- Re: From wchar_t to char32_t, new module mbszero, Paul Eggert, 2023/07/16
- Re: From wchar_t to char32_t, new module mbszero, Bruno Haible, 2023/07/17
- Re: From wchar_t to char32_t, new module mbszero, Paul Eggert, 2023/07/18
- Re: From wchar_t to char32_t, new module mbszero, Bruno Haible, 2023/07/19