bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc


From: Pádraig Brady
Subject: Re: horrible utf-8 performace in wc
Date: Wed, 7 May 2008 15:05:03 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Bo Borgerson wrote:
> Pádraig Brady wrote:
>> canonically équivalent
>> canonically équivalent
>>
>> Pádraig.
>>
>> p.s. I Notice that gnome-terminal still doesn't handle
>> combining characters correctly, and my mail client thunderbird
>> is putting the accent on the q rather than the e, sigh.
> 
> They both render correctly here (Thunderbird 2.0.0.12).

Ha, the viewer is OK actually but the composer combines
with the succeeding character (thunderbird 2.0.0.6).

> Is there a good library for combining-character canonicalization
> available?  That seems like something that would be useful to have in a
> lot of text-processing tools.  Also, for Unicode, something to shuffle
> between the normalization forms might be helpful for comparisons.

Yes that will be needed for when we get coreutils to support multibyte fully.
ICU has support for this, and Jim Meyering mentioned that there may
be support for this already in gnulib, but I haven't had a chance
to check it out yet.

> I may be misinterpreting your patch, but it seems to me that
> decrementing count for zero-width characters could potentially lead to
> confusion.  Not all zero-width characters are combining characters, right?

Yes you're probably right. I must write a little prog to check for exceptions.

Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]