bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: From wchar_t to char32_t


From: Bruno Haible
Subject: Re: From wchar_t to char32_t
Date: Tue, 04 Jul 2023 00:00:21 +0200

Paul Eggert wrote:
> The complication would be needed because diffutils is trying to count 
> columns as it goes, and in some cases it needs to stop when a column 
> count has reached a maximum. It's not two lines of code.

Indeed. I need to check the mbiter and mbuiter modules, since they do
something similar...

In the big picture, we are talking about levels of perfection that
may happen in the described situation:

  Level 1: Behave incorrectly but don't crash. This is what code that
           uses mbrtowc() does.
           See my glibc bug report
           https://sourceware.org/bugzilla/show_bug.cgi?id=30611

  Level 2: Behave correctly, except that a 2-Unicode-character sequence
           may be split although it shouldn't.
           This is what code that uses mbrtoc32() does, when it has the
           lines
                if (bytes == (size_t) -3)
                  bytes = 0;
           Without these lines, the string pointer could be decremented
           by 3, thus accessing invalid memory or running into an endless
           loop.
           This level is also what
             printf (".*s", nbytes, string);
           does: it truncates strings at a position where they should not
           be truncated. So, it's not terribly uncommon.

  Level 3: Behave correctly. Don't split a 2-Unicode-character sequence.
           This is what code that uses mbrtoc32() does, when it has the
           lines
                if (bytes == (size_t) -3)
                  bytes = 0;
           and uses !mbsinit (&state) in the loop termination condition.

> > ?? We are talking about 2 lines of code
> 
> Not for diffutils we aren't. If I understand things correctly, diffutils 
> would have to look ahead to the next mbrtoc32 call that returns a 
> nonnegative value before deciding what to do about the previous N calls 
> where the first returned a positive value and the remaining calls 
> returned (size_t) -3. This sort of lookahead would be doable but painful 
> with significant performance implications.

You're right that more than 2 lines of code are needed. But I think,
with the help of an mbsinit (&state) test, the added code and performance
implications can be kept small.

> Given your explanation, it doesn't sound like it's worth the effort.

I agree, and I explained in the glibc bug report that I would like the
zh_HK.BIG5-HKSCS locale to go away.

> I had been worried that one of these platforms would return (size_t) -3 
> and in that case I supposed we would need to switch diffutils back to 
> wchar_t for portability to these platforms without worrying about -3. 
> I'm glad to hear this is not the case.

No need to worry in this direction. Cygwin and native Windows don't
support as many encodings as glibc does.

Bruno


2023-07-03  Bruno Haible  <bruno@clisp.org>

        mbrtoc32: Document another glibc bug.
        * doc/posix-functions/mbrtoc32.texi: Reference the glibc bug in
        BIG5-HKSCS locales.

diff --git a/doc/posix-functions/mbrtoc32.texi 
b/doc/posix-functions/mbrtoc32.texi
index 93a7aa64ff..3528114bec 100644
--- a/doc/posix-functions/mbrtoc32.texi
+++ b/doc/posix-functions/mbrtoc32.texi
@@ -38,6 +38,11 @@
 Portability problems not fixed by Gnulib:
 @itemize
 @item
+This function behaves incorrectly when converting precomposed characters
+from the BIG5-HKSCS encoding:
+@c https://sourceware.org/bugzilla/show_bug.cgi?id=30611
+glibc 2.36.
+@item
 Although ISO C says this function can return @code{(size_t) -3},
 no known implementation behaves that way,
 and if it were to happen it would break common uses.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]