bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Combining Diacritical Marks


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Combining Diacritical Marks
Date: Fri, 14 Oct 2016 04:06:29 +0200
User-agent: KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; )

Hi,

Benjamin Weber wrote:
> The problem is for characters where there is a decomposed and precomposed
> form. iconv does not consider them equivalent.

This is because iconv is not the right tool for doing Unicode NFC <--> NFD
conversions. The right tool for such conversions is explained in
http://unix.stackexchange.com/questions/90100/convert-between-unicode-normalization-forms-on-the-unix-command-line

> This relates to http://stackoverflow.com/questions/9892897
> It seems that a bug report has already been filed.
> However, I could neither locate it on savannah nor on
> http://lists.gnu.org/archive/html/bug-gnu-libiconv/.

No such bug report has been filed. I'm answering on the mailing list.

>   printf '\x6F\xCC\x88\n' | iconv -f UTF-8 -t LATIN1
>   iconv: (stdin):1:1: cannot convert

This is as expected. iconv assumes that Unicode input is in NFC form.
Quoting the Unicode Standard Annex 15
http://www.unicode.org/reports/tr15/tr15-44.html :

  "The W3C Character Model for the World Wide Web 1.0: Normalization
   [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition)
   recommend using Normalization Form C for all content, because this form
   avoids potential interoperability problems arising from the use of
   canonically equivalent, yet different, character sequences in document
   formats on the Web. See the W3C Character Model for the Word Wide Web:
   String Matching and Searching [CharMatch] for more background."

In other words, NFC is the industry-wide standard for user-visible Unicode
strings/texts.

> Whereas
> 
>   printf '\x6F\xCC\x88\n' | iconv -f utf8-mac -t LATIN1
> 
> works.

On Mac OS X, Apple has added a UTF-8-MAC encoding to iconv, that probably
implements the same conversion as Mac OS X does for file names in their
file system layer.

> Collected remarks (that are not intended to belittle iconv or the people
> behind it) from users helping to trace the problem:
> "Whoever came up with combining suffixes is evil and needs to be
> trout-slapped repeatedly."

Whoever made these comments has not understood the principles of Unicode.
To remedy, make them read the chapters 1 and 2 of the Unicode standard
http://www.unicode.org/versions/Unicode9.0.0/ .

> Make Iconv Great Again

No USA politics on this mailing list, please.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]