coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)


From: Pádraig Brady
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Wed, 27 Jul 2016 09:39:07 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 27/07/16 03:47, Assaf Gordon wrote:
> 
>> On Jul 23, 2016, at 16:30, Pádraig Brady <address@hidden> wrote:
>>
>> On 23/07/16 19:05, Assaf Gordon wrote:
>>>
>>>> On Jul 23, 2016, at 06:51, Pádraig Brady <address@hidden> wrote:
>>>> I was wondering about the tool being line/record oriented.
>>>>
>>>> Disadvantages are:
>>>> requires arbitrary large buffers for arbitrary long lines
>>>> relatively slow in the presence of short/normal lines
>>>> sensitive to the current stdio buffering mode
>>>> requires -z option to support NUL termination
>>>>
>>>> Processing instead a block at a time avoid such issues.
>>>> UTF-8 at least is self synchronising, so after reading a block
>>>> you just have to look at the last 3 bytes to know
>>>> how many to append to the start of the next block.
> 
> Attached is a partial, crude implementation of stream-based processing.
> It currently only handles fixing invalid sequences, no unicode normalization 
> yet.
> 
> It contains both implementation, to ease comparison (use "-S/--stream" to use 
> the new implementation, or without to use the previous line-based 
> implementation).
> 
> The main functions are (to facilitate discussion):
> mbbuf_read - reads more data from the input, moves 'incomplete/left-over' 
> octets from previous read to the beginning of the buffer (somewhat like 
> grep's fillbuf() but not as sophisticated).
> STRM_unorm_buf - iterates over the octets in the current buffer
> STRM_unorm_fd - repeatedly reads the file and calls STRM_unorm_buf.
> 
> The tests use both methods and the results are identical (except unicode 
> normalization with is currently skipped for --stream).
> 
> Few issues are emerging:
> 1. If only validation is requires (i.e. no unicode normalization), it'll be 
> wasteful to convert the input to wchar_t then back again. It'll be better to 
> write the output as-is.  If unicode normalization is requested, then going 
> through wchar_t and uniform's filter is needed. Perhaps two separate 
> dedicated functions would be more efficient.

Yes that makes sense.
You might simplify by not converting only in the utf8 case if that helps,
as that's going to be by far the most common case.

> 2. Regarding skipping STDIO buffering: I assume you referred to dealing with 
> input. The code now uses file-descriptors and 'safe_read', thus bypassing 
> stdio buffering on input. But it still uses stdio for output (this seems in 
> line with tac, split, tr, etc.). If we want to bypass stdio as well, some 
> extra code for internal buffering might be needed.

Well if reading and writing blocks then stdio buffering is only problematic 
overhead.

> 
> 3. I believe that for this tool to be really useful, it should report the 
> line number and column of offending/invalid octets. In that case, the code 
> needs to count lines / columns, and will need to be aware of which 
> line-terminator is used - meaning "-z" is still needed.
> The attached code does count lines/columns (see struct mbbuffer), and thus is 
> a bit cumbersome.

Interesting. Fundamentally the processing doesn't need to be line oriented,
hence avoiding buffering issues and possibly better perf, but I do see it useful
to report the offsets of errors. If iterating over the buffer we can still
record "lines" seen.

> Currently it seems this optimization leads to somewhat more complicated code.
> Once I'll have the unicode normalization implemented we could compare speeds 
> and see which method is preferred.

thanks!
Pádraig




reply via email to

[Prev in Thread] Current Thread [Next in Thread]