bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: From wchar_t to char32_t


From: Paul Eggert
Subject: Re: From wchar_t to char32_t
Date: Mon, 3 Jul 2023 16:30:04 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0

On 2023-07-03 15:00, Bruno Haible wrote:
   Level 3: Behave correctly. Don't split a 2-Unicode-character sequence.
            This is what code that uses mbrtoc32() does, when it has the
            lines
                 if (bytes == (size_t) -3)
                   bytes = 0;
            and uses !mbsinit (&state) in the loop termination condition.

With diffutils even level 3 would not suffice, since diffutils truncates at input byte boundaries, so it doesn't suffice to merely treat (size_t) -3 as zero even if one also checks mbsinit. Instead, one would have to treat all the characters in the sequence ABBB... (where A is an ordinary multibyte character and the Bs all return (size_t) -3) as a single unit, because one cannot truncate in the middle of that sequence. Or wait a minute - in theory I suppose it could even be an arbitrary sequence of As and Bs, so long as the total "sizes" of the As equals the number of bytes in the original byte sequence that stands for a series of characters.

The diffutils truncation approach also has problems with coding systems that have shift state, but that's OK: nobody uses these coding systems with GNU apps as they're not practical. Similarly, any platform where mbrtoc32 returns (size_t) -3 won't be practical with GNU apps, so it should be OK for diffutils to not worry about this possibility either, given that it would be a hassle to support it. We don't have time to support every oddball coding system that POSIX allows.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]