coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)


From: Pádraig Brady
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Fri, 22 Jul 2016 12:48:21 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 22/07/16 04:23, Assaf Gordon wrote:
> Hello,
> 
>> On Jul 21, 2016, at 06:08, Pádraig Brady <address@hidden> wrote:
>> [...]
>> It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
>> also be quite cohesive in such a util.
> 
> Attached an improved version with unicode normalization support.

Wow, very nice.

> Before continuing with other stuff (e.g. more tests, documentation, news, 
> etc.),
> it's worth discussing if this is the path to take (or if we want to add this 
> to each individual utility).

I'm not sure, but it would be nice as I said if we could get away with 
"replace" mode in other utils.
By having a separate util, it follows the idea of validating/transforming input 
as early as possible
so as to simplify the rest of the system.  Also it follows the idea that if 
something can be
done separately it should be done so.

> Also, do we keep these options or modify them?
> e.g. 'uconv' uses different terminology for handling invalid sequences: stop, 
> skip, substitute, escape (corresponding to abort, discard, replace, recode 
> below).

Doesn't really matter.
I find your naming slightly more descriptive.

> To keep the implementation simple, unicode normalization requires UTF-8 
> locales - is this a valid requirement?

Given how prevalent utf8 is I think this is fine.
It other tools if there is an option we should also tune for utf-8 input.

> And of course, what about the name?

I've a slight preference for unorm

thanks!
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]