|
From: | Paul Eggert |
Subject: | Re: mbcel module for Gnulib?, incomplete multibyte sequences |
Date: | Mon, 24 Jul 2023 16:26:05 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 |
On 2023-07-24 15:58, Bruno Haible wrote:
Paul Eggert wrote:in UTF-8 the byte sequence E0 80 is not an incomplete character (in the sense that additional bytes may lead to a complete character), because every byte you append to E0 80 causes glibc mbrtoc32 to return (size_t) -1. Yet glibc mbrtoc32 returns (size_t) -2 for E0 80.And gnulib/lib/unistr/u8-mbtouc-aux.c does it wrong as well! The return value for E0 {80..9F} should be (size_t) -1, because U+0800 is E0 A0 80. I'll fix the gnulib part soon. Very good point. It looks like few people understood the implications of https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf page 125, table 3-7.
I hope we don't need to replace mbrtoc32 merely because of this obscure issue. Or at least if mbrtoc32 can be replaced, I hope an application can disable replacement merely because of this issue, assuming the application doesn't need to worry about the issue and there is a performance benefit to not replacing.
[Prev in Thread] | Current Thread | [Next in Thread] |