bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp12


From: Bruno Haible
Subject: [bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp1255 -> utf8) translation
Date: Sun, 10 Apr 2011 15:01:03 +0200
User-agent: KMail/1.9.9

Hello Ron,

> if there isn't some way to
> inform libiconv that I do *not* wish to do the canonicalization?  From
> peering at the source code it appears that the answer is, currently, "no".
> ...
> I would be extremely grateful were you to add another
> modifier "//NFD" or something like that, so those of us who want simple
> translation will get it.

The answer is indeed "no", because
  - We should maximize the use of NFC and minimize the use of NFD texts,
    according to the recommendation of the W3C.
  - People who need NFD can convert iconv's output to NFD with a single
    command: "uconv -f utf8 -t utf8 -x nfd".

> It seems somewhat unreasonable to expect that a Unicode-aware editor
> must also know about all the canonical forms and either create
> equivalence tables for the purpose of searching, or convert to NFD and
> manage that.

I disagree. An editor that is Unicode aware must do these kinds of things.
But fortunately there are libraries that help in doing that.

> Further, it seems wrong that a round-trip from UTF-8 -> 
> CP1255 -> UTF-8 will produce files which are not byte-identical (and
> that there is no option to allow it).

That's simply a consequence of the existence of precomposed characters in
Unicode, and occurs also with ISO-8859-1. It may seem "wrong" to you, but
it was one of the design choices of Unicode (admittedly for historical and
backward compatibility reasons). That's life.

> The specific problem I am encountering is the files I am dealing with
> are encoded in CP1255, and vim simply tells iconv() to do the
> conversion... I then end up with the normalized UTF-8 which, though it
> is correct, is difficult to deal with.

It is correct for vim to call iconv(). The user expectations for a Hebrew
text are the same for a CP1255 encoded file as for an UTF-8 (NFC) encoded
file, therefore it is a good implementation strategy to map one onto the other
and work with UTF-8 internally.

The "difficult to deal with" part needs to be handled in the editor.

> I will ask Bram Moolenaar (author of Vim) again about making the regexp
> engine be able to handle these sorts of sequences.  Last time I asked (a
> number of years ago), he was not receptive to modifying the regexp
> engine.  Perhaps your suggestion of 'libunistring' may be acceptable.

I hope that GNU libunistring may help Bram Moolenaar to improve vim's
regexp engine.

There are plans to add a regexp engine to libunistring, but these plans are
not very concrete up to now.

Bruno
-- 
In memoriam Hendrik Nicolaas Werkman 
<http://en.wikipedia.org/wiki/Hendrik_Nicolaas_Werkman>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]