bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp12


From: Bruno Haible
Subject: [bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp1255 -> utf8) translation
Date: Sun, 10 Apr 2011 13:55:53 +0200
User-agent: KMail/1.9.9

[CCing bug-gnu-libiconv]

Hello Ron,

> Attached is a ZIP with three files which illustrate the problem.
> 
> The source "heb1.utf" is converted to heb1.cp1255:
> 
>     iconv -f utf-8  -t cp1255  heb1.utf > heb1.cp1255
> 
> and converted back to UTF8:
> 
>     iconv -f cp1255 -t utf-8 heb1.cp1255 > heb2.utf
> 
> 
> Note that the character sequence:  05d9 05bc 05b9  is re-converted to
> fb39 05b9

Yes, the sequence of characters
  U+05D9 HEBREW LETTER YOD
  U+05BC HEBREW POINT DAGESH
is canonically equivalent to
  U+FB39 HEBREW LETTER YOD WITH DAGESH

For explanation of "canonical equivalence" and the normalization forms NFC and
NFD, see Unicode UAX #15 <http://www.unicode.org/reports/tr15/>.

In particular:
  "The W3C Character Model for the World Wide Web, Part II: Normalization
   [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition)
   recommend using Normalization Form C for all content, because this form
   avoids potential interoperability problems arising from the use of
   canonically equivalent, yet different, character sequences in document
   formats on the Web. See the W3C Requirements for String Identity,
   Matching, and String Indexing [CharReq] for more background."
 [CharNorm] = http://www.w3.org/TR/charmod-norm/

Normalization form NFC is recommended everywhere, and 'iconv' (both from
GNU libiconv and from GNU libc) produces this normalization form.

> I am using "vim" to edit Hebrew texts, and have been bothered for a
> while with a specific problem.
> 
> The problem is that some sequences map to Unicode composited characters,
> which makes editing (specifically searching!) more difficult that it
> should be.

Searching for substrings, and meeting user expectations while doing that,
has indeed become more complex that before Unicode, see Unicode TR #10
<http://www.unicode.org/reports/tr10/>, and it is the duty of the
programs ("vim" in this case) to meet these user expectations.

Maybe GNU libunistring <http://www.gnu.org/software/libunistring/> may help
the vim implementors in doing this.

> While it may be correct, it really makes editing very difficult.  Is
> there a way to change this behavior of iconv?

You can, of course, convert your files from NFC to NFD before editing, and
convert them back from NFD to NFC after editing. A ready-made program for
doing so is 'uconv', part of ICU. "uconv -f utf8 -t utf8 -x nfc" and
"uconv -f utf8 -t utf8 -x nfd".

Bruno
-- 
In memoriam Hendrik Nicolaas Werkman 
<http://en.wikipedia.org/wiki/Hendrik_Nicolaas_Werkman>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]