[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp12
[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp1255 -> utf8) translation
Sun, 10 Apr 2011 15:11:25 +0300
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:188.8.131.52) Gecko/20110223 Lightning/1.0b2 Thunderbird/3.1.8
Thank you, Bruno, for your response. My comments are below:
> For explanation of "canonical equivalence" and the normalization forms NFC and
> NFD, see Unicode UAX #15 <http://www.unicode.org/reports/tr15/>.
I do understand that; what I am wondering is if there isn't some way to
inform libiconv that I do *not* wish to do the canonicalization? From
peering at the source code it appears that the answer is, currently, "no".
> Searching for substrings, and meeting user expectations while doing that,
> has indeed become more complex that before Unicode, see Unicode TR #10
> <http://www.unicode.org/reports/tr10/>, and it is the duty of the
> programs ("vim" in this case) to meet these user expectations.
It seems somewhat unreasonable to expect that a Unicode-aware editor
must also know about all the canonical forms and either create
equivalence tables for the purpose of searching, or convert to NFD and
manage that. Further, it seems wrong that a round-trip from UTF-8 ->
CP1255 -> UTF-8 will produce files which are not byte-identical (and
that there is no option to allow it).
> You can, of course, convert your files from NFC to NFD before editing, and
> convert them back from NFD to NFC after editing. A ready-made program for
> doing so is 'uconv', part of ICU. "uconv -f utf8 -t utf8 -x nfc" and
> "uconv -f utf8 -t utf8 -x nfd".
The specific problem I am encountering is the files I am dealing with
are encoded in CP1255, and vim simply tells iconv() to do the
conversion... I then end up with the normalized UTF-8 which, though it
is correct, is difficult to deal with.
I will ask Bram Moolenaar (author of Vim) again about making the regexp
engine be able to handle these sorts of sequences. Last time I asked (a
number of years ago), he was not receptive to modifying the regexp
engine. Perhaps your suggestion of 'libunistring' may be acceptable.
Nevertheless, I would be extremely grateful were you to add another
modifier "//NFD" or something like that, so those of us who want simple
translation will get it.
Thank you for your time,