coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unex


From: Eric Fischer
Subject: Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr
Date: Wed, 17 Jan 2018 11:01:05 -0800

Thanks for the feedback.

To clear one thing up at the start: I am not Eric Blake, so the earlier cut
-d patch is not mine.

Thanks also for clarifying the license requirements. I will follow up with
Mapbox legal to find out how we can work with this.

Sebastian, I think you may have been testing against
https://github.com/ericfischer/coreutils-utf8 (my old fork of
coreutils-8.28), not https://github.com/ericfischer/coreutils/tree/multibyte
(a fork of the coreutils development repository). I will delete the old
repository now to prevent future confusion.

I say this because in the new branch, following Assaf's earlier comments, I
now pass undecodable bytes through to the output rather than reporting an
error. All the existing single-byte tests pass when I run them myself.

I have not tried the tests from Assaf's branch, but will run them now and
fix any errors that they diagnose in my code.

Assaf, I am sensitive to performance (and have added a cache for wcwidth()
because it is notably slow on my system). I will be happy to optimize any
other code paths that turn out to be a problem in practice.

Would it resolve any of the wchar_t width concerns to use C11's char32_t
instead of wchar_t? I don't consider this ideal because it requires a newer
compiler, and the wctype functions are still only defined for the wchar_t
range, and no WEOF value is defined for it. (I am also not sure after
reading the C standard whether wchar_t and char32_t are guaranteed to be in
the same order, except in the case where they are both advertised to be in
Unicode order.) The other alternative I can see is to decode UTF-8 by hand
and pass characters beyond the wchar_t range through as wider opaque blobs,
the same way raw bytes are. Compared to passing them through as halves of
surrogate pairs, this would fix the character numbering in cut and sort,
but would risk the possibility of wctype errors on any Big5- or
EUC-JP-oriented system where wchar_t is not in fact in Unicode order.

Eric


reply via email to

[Prev in Thread] Current Thread [Next in Thread]