coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)


From: Pádraig Brady
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Sat, 23 Jul 2016 21:30:07 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 23/07/16 19:05, Assaf Gordon wrote:
> 
>> On Jul 23, 2016, at 06:51, Pádraig Brady <address@hidden> wrote:
>> I was wondering about the tool being line/record oriented.
>>
>> Disadvantages are:
>>  requires arbitrary large buffers for arbitrary long lines
>>  relatively slow in the presence of short/normal lines
>>  sensitive to the current stdio buffering mode
>>  requires -z option to support NUL termination
>>
>> Processing instead a block at a time avoid such issues.
>> UTF-8 at least is self synchronising, so after reading a block
>> you just have to look at the last 3 bytes to know
>> how many to append to the start of the next block.
> 
> block-at-a-time would work well for detecting/fixing invalid multibyte 
> sequences, especially in UTF-8.
> But I'm not sure about other multibyte encodings (I'll have to investigate).
> 
> However, for unicode normalization, I am not sure there's a stream interface 
> to it (gnu lib's uniform takes a whole string to normalize). IIUC, 
> normalization requires being able to examine some unicode characters ahead.

Oh right I see.

You're saying that splitting per line is a natural way to ensure
you don't split processing in the middle of a decomposed character,
which is significant in normalization processing.

To support that you'd have to do something like:

  filter = uninorm_filter_create()
  while (read(fd, buf, BUFSIZE))
    for each mbchar;
      uchar = mbtowchar(mbchar);
      if (!uchar) //fix
        uninorm_filter_write(filter, uchar);
    uninorm_filter_flush(filter)

I don't know how that would perform compared to u8_normalize().
It might be faster since we're already processing each char?
Or it might be slower if u8_normalize() has some utf8 specific optimizations.

cheers,
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]