[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unex
From: |
Assaf Gordon |
Subject: |
Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr |
Date: |
Wed, 17 Jan 2018 01:37:39 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 |
Hello,
On 2018-01-17 12:45 AM, Sebastian Kisela wrote:
I have checked the Eric's effort on the multibyte support for coreutils.
The work done seems solid.
Thank you for pitching in to the multibyte effort!
(and your previous patch for "cut -d" is on my TODO list, I haven't
forgotten it).
[...] and all of the tests that were using C locale failed
>
I believe the reason is the approach to "error handling" as Eric
expressed [...]
* Handling of invalid encodings. I generally stop with an error;
you wrap the foreign byte and pass it through to the output as an opaque
object.
There are three important principles here:
1. Adding multibyte support to GNU coreutils should not cause
regressions to existing scripts. Stopping on multibyte sequences error
is such a regression.
Consider tr/cut/head/tail - which have worked on GNU/Linux systems for
at least couple of decades with any input regardless of the user's
locale.
If all of a sudden they fail because the input isn't valid in the
current locale - that would be a disruptive regression,
and a big disservice to our users.
2. Proper multibyte processing is slow.
Remember that GNU programs can not assume UTF-8 input (which is
stateless and could be optimized). The locale could be zh_TW.BIG5
or ja_JP.eucJP or several other stateful locales.
This means that repeatedly calling multibyte functions could
significantly slow down the processing.
It is highly desirable that the unibyte locale code path
(which has been optimized over time) remains as fast as possible.
See Eric's comment here:
https://lists.gnu.org/archive/html/bug-coreutils/2010-02/msg00078.html
This implies that it is desirable to keep the existing and efficient
unibyte in addition to having a separated multibyte code path.
* Surrogate pairs. I trust wchar_t to be a sufficient character
type; you
have a special case for UTF-16 systems.
Here I agree with the approach from Eric.
3. We can not safely assume such a thing (wchar_t is sufficient)
primarily because of two environments:
Windows and Cygwin, and to a lesser extent due to AIX.
Please see previous discussions about it, here:
https://crashcourse.housegordon.org/coreutils-multibyte-support.html#cygwin
----
These are the guiding principles I've assumed when developing
my patches, based on past discussions and comments from other
maintainers.
If you (or others) want to revisit and debate them - I encourage all
to chime in.
regards,
- assaf
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Eric Fischer, 2018/01/10
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Assaf Gordon, 2018/01/17
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Sebastian Kisela, 2018/01/17
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr,
Assaf Gordon <=
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Eric Fischer, 2018/01/17
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Eric Fischer, 2018/01/17
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Assaf Gordon, 2018/01/17
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Eric Fischer, 2018/01/17
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Eric Fischer, 2018/01/17
- Prev by Date:
Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
- Next by Date:
Why cut treats one column input differently for out-of-range field spec?
- Previous by thread:
Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
- Next by thread:
Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
- Index(es):