coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unex


From: Assaf Gordon
Subject: Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
Date: Wed, 17 Jan 2018 01:37:39 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0

Hello,

On 2018-01-17 12:45 AM, Sebastian Kisela wrote:
I have checked the Eric's effort on the multibyte support for coreutils. The work done seems solid.

Thank you for pitching in to the multibyte effort!
(and your previous patch for "cut -d" is on my TODO list, I haven't forgotten it).

[...] and all of the tests that were using C locale failed
>
I believe the reason is the approach to "error handling" as Eric expressed [...]

       * Handling of invalid encodings. I generally stop with an error;
    you wrap the foreign byte and pass it through to the output as an opaque 
object.

There are three important principles here:

1. Adding multibyte support to GNU coreutils should not cause
regressions to existing scripts. Stopping on multibyte sequences error
is such a regression.

Consider tr/cut/head/tail - which have worked on GNU/Linux systems for
at least couple of decades with any input regardless of the user's
locale.
If all of a sudden they fail because the input isn't valid in the current locale - that would be a disruptive regression,
and a big disservice to our users.

2. Proper multibyte processing is slow.
Remember that GNU programs can not assume UTF-8 input (which is stateless and could be optimized). The locale could be zh_TW.BIG5
or ja_JP.eucJP or several other stateful locales.
This means that repeatedly calling multibyte functions could
significantly slow down the processing.

It is highly desirable that the unibyte locale code path
(which has been optimized over time) remains as fast as possible.
See Eric's comment here:
https://lists.gnu.org/archive/html/bug-coreutils/2010-02/msg00078.html

This implies that it is desirable to keep the existing and efficient
unibyte in addition to having a separated multibyte code path.

       * Surrogate pairs. I trust wchar_t to be a sufficient character
    type; you
    have a special case for UTF-16 systems.

Here I agree with the approach from Eric.

3. We can not safely assume such a thing (wchar_t is sufficient) primarily because of two environments:
Windows and Cygwin, and to a lesser extent due to AIX.

Please see previous discussions about it, here:
https://crashcourse.housegordon.org/coreutils-multibyte-support.html#cygwin

----

These are the guiding principles I've assumed when developing
my patches, based on past discussions and comments from other
maintainers.

If you (or others) want to revisit and debate them - I encourage all
to chime in.

regards,
 - assaf





reply via email to

[Prev in Thread] Current Thread [Next in Thread]