bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] iconv not catching bad bytes for ISO-8859-1


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] iconv not catching bad bytes for ISO-8859-1
Date: Fri, 14 Aug 2015 12:28:14 +0200
User-agent: KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; )

Hi,

Kenneth Reid Beesley wrote on 13.08.2015:
> Problem:  iconv not catching/detecting bad bytes when converting from a file 
> alleged to be ISO-8859-1 (but it’s not)
> 
> Dear All,
> 
> I’m using iconv (GNU libiconv 1.14), written by Bruno Haible, in a SUSE Linux 
> system.
> Also iconv (GNU libiconv 1.11) on a separate machine (OS X 10.10.4).
> 
> 1.  I create a file, input1252.txt, that contains hex byte values x91 and 
> x92.  This file is encoded in CP1252,
> where x91 and x92 are legal/defined bytes.
> 
> These two bytes are not defined in ISO-8859-1 
> 
> 2.  I run the following script
> 
> iconv -f ISO-8859-1 -t UTF-8 —byte-subst=“<PROBLEM: 0x%x>”  
> —unicode-subst=“<PROBLEM: U+%04X>” input1252.txt > out.txt
> 
> i.e. telling iconv (incorrectly) that the input file is Latin 1, and telling 
> it to convert it
> to UTF-8.  I expect the x91 and x92 bytes to be recognized as 
> not-legal-in-Latin1,
> and I expect to see <PROBLEM: 0x91> and <PROBLEM: 0x92> in the out.txt file.

Your expectation is ill-founded. ISO-8859-1 has no unassigned code points.
That is, all 256 byte values are valid.

Witness:

1) Wikipedia https://en.wikipedia.org/wiki/ISO/IEC_8859-1 says
   "In 1992, the IANA registered the character map ISO_8859-1:1987, more
    commonly known by its preferred MIME name of ISO-8859-1 (note the extra
    hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet.
    This map assigns the C0 and C1 control characters to the unassigned code
    values thus provides for 256 characters via every possible 8-bit value."

2) When you go to
   http://www.haible.de/bruno/charsets/conversion-tables/index.html
   -> ISO-8859-* -> ISO-8859-1, you can see that all charset converters
   from different vendors implement the ISO-8859-1 <--> Unicode conversion
   in the same way.

Probably you know that the byte values 0x7F..0x9F in ISO-8859-1 don't
correspond to *graphic* characters in ISO-8859-1 (while some of them
correspond to graphic characters in Windows-1252). But iconv is the
wrong tool to make a distinction between graphic and non-graphic characters.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]