Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unex

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unex

From:	Eric Fischer
Subject:	Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
Date:	Sat, 30 Dec 2017 11:40:39 -0800

Thanks for the feedback. My changes are now in the "multibyte" branch at

  https://github.com/ericfischer/coreutils/tree/multibyte

branched from the savannah coreutils repository.

I've moved my lib changes (all multibyte or wide versions of existing
single-byte functions) into a shared file in the src directory. They could
be moved upstream if they turn out to be useful outside the scope of
coreutils.

If I'm reading your web page and code correctly, it sounds like the main
things we disagree upon are:

  * Character widths. I treat any printing character as being of equal
width (as it is on my display); you use wcswidth() to try to identify the
characters' widths.

  * Handling of invalid encodings. I generally stop with an error; you wrap
the foreign byte and pass it through to the output as an opaque object.

  * Case-insensitive comparison. I follow POSIX and map lower case to upper
case equivalents where available; you use a case-insensitive collator.

  * Surrogate pairs. I trust wchar_t to be a sufficient character type; you
have a special case for UTF-16 systems.

It is true that I should pay more attention to character widths in expand,
unexpand, fold, fmt, and pr. In particular I should make sure that
zero-width characters are treated as zero-width and that they stay attached
to the previous character so that combining accents will work. I don't
think any more character width awareness than that is portable between
displays.

If wrapping foreign bytes is a requirement, I could do that, although it
seems like unnecessary complexity when LC_ALL=C is available for binary
files and other implementations get away with reporting errors when given
files with invalid encodings.

I don't think there is a good solution to case folding. On systems like
glibc that have working collation, sort will already fold case whether or
not you ask for it. On systems like MacOS that have broken collation, there
is no collator to resort to when case mapping isn't sufficient.

I don't think there is a good solution to the surrogate pair problem
either. On systems where wide characters are only 16 bits, the wctype
functions will be wrong on characters beyond that limit, so there's only so
much the tools can do.

Have I missed or misrepresented anything important? Thanks!

Eric

[Prev in Thread]

Current Thread

[Next in Thread]

Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Eric Fischer, 2017/12/29
- Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Assaf Gordon, 2017/12/29
  - Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr, Eric Fischer <=

Prev by Date: Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
Previous by thread: Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
Index(es):
- Date
- Thread