Re: From wchar_t to char32

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: From wchar_t to char32_t

From:	Bruno Haible
Subject:	Re: From wchar_t to char32_t
Date:	Mon, 10 Jul 2023 16:58:43 +0200

Paul Eggert wrote on 2023-07-06:
> in reviewing it found a minor 
> glitch or two and some opportunities for simplification. I installed the 
> attached further patch which I hope fixes glitches without breaking 
> anything else.

Comments:

  - Typo: s/mbrtoc23/mbrtoc32/

  - The rationale for defining and initializing the mbstate_t at the function
    scope was that on BSD and macOS systems, an mbstate_t is 128 bytes large,
    thus the time to zero-initialize is not negligible. The code with
    minimal-scope mbstate_t is clearer, but slower on BSD systems (assuming
    a string with many switches between ASCII and non-ASCII characters).
    OTOH, on a purely ASCII string, it's obviously faster to not initialize
    an mbstate_t than to initialize it.

> One other thing I discovered in my review. POSIX says that 'diff' need 
> not support locking-shift sequences[1], and this business of mbrtoc23 
> returning (size_t) -3 is in a murky area as it would appear to fall into 
> the locking-shift sequence category (at any rate, it doesn't appear to 
> be a single-shift encoding which is POSIX's only other option for 
> state-dependent encodings). Or maybe the next version of POSIX will have 
> to change in this area?
> [1]: 
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02

I think this wording regarding "single-shift" sequences and "locking-shift"
sequences is more than 20 years old:

  - "single-shift" encodings are encodings such as EUC-JP. Before 2000, some
    people viewed them as encodings with "shift". This paragraph is merely
    a clarification that it's better to view these encodings as normal
    multibyte encodings without shift.

  - "locking-shift" encodings are things like ISO-2022-JP-2. Around 1999,
    some people were experimenting with a hacked Linux libc that used
    this encoding as a locale encoding. Of course, the resulting system
    was full of bugs, because even simple operations such as concatenating
    two directory names sometimes produced wrong results. And I'm not
    even talking about the missing normalization of file names...
    So, since 2000, there is an overall agreement that "locking-shift"
    encodings are not usable as locale encodings. They are merely usable
    with the 'iconv' facility.

POSIX does not have a term for the type of encoding that BIG5-HKSCS is,
where an (indivisible) multibyte-sequence maps to a sequence of 2 Unicode
characters.

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

Re: mbcel module for Gnulib?, incomplete multibyte sequences, (continued)

Prev by Date: Re: argp test failure: test-argp-2.sh
Next by Date: Re: From wchar_t to char32_t
Previous by thread: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Next by thread: Re: From wchar_t to char32_t
Index(es):
- Date
- Thread