bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] iconv issue


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] iconv issue
Date: Sat, 01 Oct 2016 17:23:45 +0200
User-agent: KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; )

Hi,
Hi,

Kenneth Nellis wrote on 2016-06-10:
> $ file f
> f: exported SGML document, UTF-8 Unicode (with BOM) text, with CRLF line 
> terminators
> ...
> Accordingly, it seems strange, perhaps a bug?, that the former of the 
> following two lines doesn't work, but the latter does:
> 
> $ cat f | iconv -f UTF-8 -t Latin1 > x
> iconv: (stdin):1:0: cannot convert
> $ cat f | iconv -f UTF-8 -t UTF-16 | iconv -f UTF-16 -t Latin1 > x
> $

The output of the 'file f' command shows that the contents of f starts with a
U+FEFF character. According to RFC 3629 [1] section 6:

  "It is therefore RECOMMENDED to avoid stripping an initial
   U+FEFF interpreted as a signature without a good reason, to ignore it
   instead of stripping it when appropriate (such as for display) and to
   strip it only when really necessary."

It is therefore OK that iconv does not strip away the leading U+FEFF character.

The seconds line succeeds because the 'iconv -f UTF-8 -t UTF-16' command
leaves the U+FEFF character in place and the 'iconv -f UTF-16 ...' command
then strips it away. This is because UTF-16 handles the byte-order mark.

Yes, I know such BOMs frequently occur in XML files written by Windows tools,
because some Windows developers have/had the mindset that a BOM was a good
thing. When in fact it is a bad thing (in the case of UTF-8).

Bruno

[1] https://tools.ietf.org/html/rfc3629




reply via email to

[Prev in Thread] Current Thread [Next in Thread]