bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Byte-order learned from BOM is forgotten on reset


From: Tomas Kalibera
Subject: [bug-gnu-libiconv] Byte-order learned from BOM is forgotten on reset
Date: Thu, 12 Dec 2024 14:12:37 +0100
User-agent: Mozilla Thunderbird

Dear developers,

libiconv forgets the byte-order it has learned from BOM in case a reset is issued. The attached program demonstrates the problem, seen still in version 1.17. Perhaps this is a bug?

The iconv implementation in glibc doesn't do this (tested on Ubuntu 24.04).

Ulrich Drepper writes [https://bugzilla.redhat.com/show_bug.cgi?id=165368]:

"Flushing using iconv() only resets the shift state.  This is needed for
stateful encodings with states where the caller wants a converted string to
end in the initial state.  The BOM recognition has nothing to do with shift
states.  Once the byte order is determined this is a property which stays
with the iconv_t descriptor for its lifetime."

POSIX says:  "For state-dependent encodings, the conversion descriptor cd is placed into its initial shift state by a call for which inbuf is a null pointer," but UTF-16 (nor UTF-32) isn't a state-dependent encoding (6.2 clarifies what is meant by state-dependent encodings).

On Ubuntu 24.04 with glibc iconv, the output of the test program is:

Conversion works iteratively (little-endian).
Conversion works iteratively (big-endian).

With GNU libiconv 1.17 it is:

Iterative conversion fails with reset (little-endian).
Conversion works iteratively (big-endian).

Perhaps the use case feels a bit arbitrary, but a program iteratively converting an input string might want to flag invalid bytes from the input in the output using some ASCII text, e.g. '<XX>' for a byte XX. Such a program might want to support also state-dependent encodings, so it would issue a reset when encountering an invalid byte, so that the ASCII text in the output would be interpreted correctly. But when this reset causes the byte-order to be forgotten, the rest of the input won't be converted properly.

Thanks
Tomas

Attachment: reset.c
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]