|
From: | Eric Fischer |
Subject: | Re: Multibyte support for sort, uniq, join, tr, cut, paste, expand, unexpand, fmt, fold, and pr |
Date: | Wed, 17 Jan 2018 11:01:05 -0800 |
Thanks for the feedback. To clear one thing up at the start: I am not Eric Blake, so the earlier cut -d patch is not mine. Thanks also for clarifying the license requirements. I will follow up with Mapbox legal to find out how we can work with this. Sebastian, I think you may have been testing against https://github.com/ericfischer/coreutils-utf8 (my old fork of coreutils-8.28), not https://github.com/ericfischer/coreutils/tree/multibyte (a fork of the coreutils development repository). I will delete the old repository now to prevent future confusion. I say this because in the new branch, following Assaf's earlier comments, I now pass undecodable bytes through to the output rather than reporting an error. All the existing single-byte tests pass when I run them myself. I have not tried the tests from Assaf's branch, but will run them now and fix any errors that they diagnose in my code. Assaf, I am sensitive to performance (and have added a cache for wcwidth() because it is notably slow on my system). I will be happy to optimize any other code paths that turn out to be a problem in practice. Would it resolve any of the wchar_t width concerns to use C11's char32_t instead of wchar_t? I don't consider this ideal because it requires a newer compiler, and the wctype functions are still only defined for the wchar_t range, and no WEOF value is defined for it. (I am also not sure after reading the C standard whether wchar_t and char32_t are guaranteed to be in the same order, except in the case where they are both advertised to be in Unicode order.) The other alternative I can see is to decode UTF-8 by hand and pass characters beyond the wchar_t range through as wider opaque blobs, the same way raw bytes are. Compared to passing them through as halves of surrogate pairs, this would fix the character numbering in cut and sort, but would risk the possibility of wctype errors on any Big5- or EUC-JP-oriented system where wchar_t is not in fact in Unicode order. Eric
[Prev in Thread] | Current Thread | [Next in Thread] |