[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp12
[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp1255 -> utf8) translation
Sun, 10 Apr 2011 15:01:03 +0200
> if there isn't some way to
> inform libiconv that I do *not* wish to do the canonicalization? From
> peering at the source code it appears that the answer is, currently, "no".
> I would be extremely grateful were you to add another
> modifier "//NFD" or something like that, so those of us who want simple
> translation will get it.
The answer is indeed "no", because
- We should maximize the use of NFC and minimize the use of NFD texts,
according to the recommendation of the W3C.
- People who need NFD can convert iconv's output to NFD with a single
command: "uconv -f utf8 -t utf8 -x nfd".
> It seems somewhat unreasonable to expect that a Unicode-aware editor
> must also know about all the canonical forms and either create
> equivalence tables for the purpose of searching, or convert to NFD and
> manage that.
I disagree. An editor that is Unicode aware must do these kinds of things.
But fortunately there are libraries that help in doing that.
> Further, it seems wrong that a round-trip from UTF-8 ->
> CP1255 -> UTF-8 will produce files which are not byte-identical (and
> that there is no option to allow it).
That's simply a consequence of the existence of precomposed characters in
Unicode, and occurs also with ISO-8859-1. It may seem "wrong" to you, but
it was one of the design choices of Unicode (admittedly for historical and
backward compatibility reasons). That's life.
> The specific problem I am encountering is the files I am dealing with
> are encoded in CP1255, and vim simply tells iconv() to do the
> conversion... I then end up with the normalized UTF-8 which, though it
> is correct, is difficult to deal with.
It is correct for vim to call iconv(). The user expectations for a Hebrew
text are the same for a CP1255 encoded file as for an UTF-8 (NFC) encoded
file, therefore it is a good implementation strategy to map one onto the other
and work with UTF-8 internally.
The "difficult to deal with" part needs to be handled in the editor.
> I will ask Bram Moolenaar (author of Vim) again about making the regexp
> engine be able to handle these sorts of sequences. Last time I asked (a
> number of years ago), he was not receptive to modifying the regexp
> engine. Perhaps your suggestion of 'libunistring' may be acceptable.
I hope that GNU libunistring may help Bram Moolenaar to improve vim's
There are plans to add a regexp engine to libunistring, but these plans are
not very concrete up to now.
In memoriam Hendrik Nicolaas Werkman