[coreutils] Re: [PATCH] join: support multi-byte character encodings

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[coreutils] Re: [PATCH] join: support multi-byte character encodings

From:	Pádraig Brady
Subject:	[coreutils] Re: [PATCH] join: support multi-byte character encodings
Date:	Mon, 20 Sep 2010 09:45:31 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 19/09/10 14:13, Bruno Haible wrote:
> Hi Pádraig,

Hi Bruno,

Thanks very much for the review.

>
> > This is my start at applying robust and efficient multi-byte
> > processing to coreutils.
>
> Actually, it is the continuation of the discussion and based on the patch
> from March 2009
> <http://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html>.

Yes I should have mentioned that, sorry.
Personally I started with a clean slate and the libunistring docs,
and copy/pasted/modified the ulc_casec..() calls when I got to that bit.
I also need to add some general notes in the NEWS/changelog about the
use of libunistring, which enables this.

>
> What has changed since then?
>
> Ad 1) Which functions to use for case comparison in coreutils?
> I see you replaced the calls to ulc_casecmp / mbmemcasecmp with calls to
> ulc_casecoll / mbmemcasecoll. This is correct because POSIX specifies that
> the sorting should obey the LC_COLLATE environment variable, i.e. use locale
> dependent collation rules.
>
> Ad 2) How should the case comparison in "sort -f" behave?
> I think you need to decide on this before changing 'join', because 'join'
> and 'sort' need to behave the same way, otherwise 'join' errs out when
> processing results from 'sort'.

That's a very good point, and suggests we need to do update
sort, join and uniq as a unit.

>
> Ad 3) Executable size.
> This has now been resolved through the creation of libunistring as a shared
> library.
>
>
> About multibyte processing vs. UTF-8 processing:
> > I was wondering about unconditionally converting
> > input to UTF-8 before processing. I didn't do this yet as:
> >   Currently we support invalid input chars by processing as in the C locale.
>
> Correct. This is one of the major design decisions Paul, Jim, and I agreed 
> upon
> in 2001. It is this requirement which forbids converting the input to a 
> wchar_t
> stream, doing processing with wchar_t objects, and producing a stream of 
> wchar_t
> objects that are finally converted to multibyte representation again.
>
> It is this requirement which also forbids converting the input to UTF-8, doing
> processing with Unicode characters, and converting the Unicode character 
> stream
> to multibyte representation at the end. This approach is acceptable for a word
> processor that can refuse to open a file, or for more general applications.
> But for coreutils, where classical behaviour is to get reasonable processing 
> in
> the "C" locale of files encoded in UTF-8, EUC-JP, or ISO-8859-2, this approach
> cannot be done.
>
> For this reason, gnulib has the modules 'mbchar', 'mbiter', 'mbuiter', 
> 'mbfile',
> which provide a "multibyte character" datatype that accommodates also invalid
> byte sequences.
>
> Emacs handles this requirement by extending UTF-8. But this approach is unique
> to Emacs: libunistring and other software support plain UTF-8, not extended
> UTF-8.

I had noticed modified UTF-8 when looking at this and thought it quite
useful. This for example encodes \0 as C0 80 which allows processing
with standard C functions.  I presume emacs extended UTF-8 does
something similar for invalid codes.  Now most u8... libunistring functions
take a size and so work with strings with embedded NULs.
Additionally u8_check() fails for C0 80, and u8_mbsnlen() counts them as
separate characters. The last detail pretty much precludes using them.

>
> > wanted to transform the input as little as possible.
>
> Yes, this is a wise guideline.
>
> >   I was unsure how to generally correlate portions of a UTF-8 string with 
> > its
> >   corresponding source string
>
> The u8_conv_from_encoding function
> <http://www.gnu.org/software/libunistring/manual/html_node/uniconv_002eh.html>
> stores offsets in an array that give you a mapping from the source string
> pointers to the UTF-8 string pointers.

Ah, that's useful, thanks.

> > I may revisit the idea of processing using UTF8 internally if we support
> > normalization internally to the utils.
>
> I would advise against normalization that forgets the original string. Of
> course, you can associate the original string with the UTF-8 equivalent, if
> that exists, and keep both in memory; that becomes merely a speed vs. memory
> trade-off.

I agree, though there are still normalization edge cases to deal with.

> > I do try to special case UTF-8 where beneficial
>
> I think this should be done in an encoding agnostic way. Don't test whether
> the locale encoding is "UTF-8". This is bad, because
>   - The GB18030 encoding is just as powerful as UTF-8; it's just a different
>     encoding.

Well UTF-8 is more popular and supports faster processing,
so I think it's worth the special casing. GB18030 is not
a "nice" encoding for processing, so applying the more
general logic for it is fine I think.

>   - The same handling of the Turkish i should be done in ISO-8859-9 locales
>     as in UTF-8 locales. The same handling of the Greek sigma should be done
>     in ISO-8859-7 locales as in UTF-8 locales. Users would be very surprised
>     if moving from unibyte encodings to UTF-8 would change the semantic of
>     processing their text.

Right. I'm only talking about the encoding here.
In this case the only special casing is efficiently searching for
the tab char when utf8_LC_CTYPE.

> > One consequence of linking with libunistring ... printf ...
> > incurs a 16% startup overhead
>
> We discussed this in the thread at
> <http://lists.gnu.org/archive/html/bug-libunistring/2010-09/msg00000.html>.
>
> > On a related note, I noticed that bootstrap doesn't update ./configure 
> > correctly
>
> See <http://lists.gnu.org/archive/html/bug-gnulib/2010-09/msg00090.html>.
>
>
> Now, going through your patch.
>
> - Function xfields. I agree with the use of mbiter for the multibyte case and 
> of
>   'char' variables for the unibyte case. But the algorithm is duplicated, once
>   for MB_CUR_MAX > 1, once for MB_CUR_MAX == 1. In 2001, Paul and Jim said 
> that
>   they wanted
>     - correct handling of multibyte locales, even with invalid input,
>     - no speed decrease for unibyte locales,
>     - no code duplication,
>     - maintainable code.
>   At that time I wrote macros which expand to uses of 'mbiter' or plain 'char'
>   variables, depending on macros. See attached diff. At that time, Jim did not
>   like the extra file. Two months ago, he told me an extra file would be
>   acceptable if we really find no better solution.

So my patch does the above with with code duplication.
But it's only a small chunk and I much prefer explicit
duplication like this, compared to macro abstractions
and a separate file.  It's a pity that we need to
support "wide blank" chars, as then we wouldn't need the
duplication at all. Also having the duplicated code
for the unibyte case may make things more maintainable,
as one has a unibyte reference for the algorithm which
could very possibly help in understanding the multibyte code.
Obviously large amounts of duplicated code is bad, so
one has to make a call as to when the extra abstraction is needed.

BTW, the duplication is just for speed. I just tested the
multi-byte chunk when in the "C" locale and it incurred
an overhead of 24%

>
> - Function xfields. The "FIXME: ... when splitting on non blanks, Surrounding
>   chars might be needed for context". I would say in this case, context should
>   be ignored. If someone specifies to split at Greek sigmas, he certainly does
>   wants to consider all Greek sigmas, not some of them depending on their
>   context. So I would turn this "FIXME" into a note.

Fair enough.

>
> - Function keycmp. For speed, it should be useful to determine
>   uc_locale_language () once only.

Done. Inconsequential though (0.3% (I really like the fact that
my laptop is stable enough to measure these differences)).

>
> - Function main, handling of 't' argument. Let's be clear about what POSIX
>   says <http://www.opengroup.org/onlinepubs/9699919799/utilities/join.html>:
>   The 't' option takes a single character as argument, and the LC_CTYPE 
> category
>   of the current locale determines the boundaries of characters.
>   You _can_ allow "combined characters" as 't' arguments, but that is an
>   extension to POSIX, so one should make sure that
>     1. this extension does not cause regression to the POSIX part of the
>        functionality,
>     2. it is turned off when the environment variable POSIXLY_CORRECT is set.

Well I can't see complex chars being used for other purposes other than
delimiting. Also since we need much the same code for searching for
"normal" multi-byte characters, I think it's sensible to allow
complex characters, even if POSIXLY_CORRECT is set.

>
> +#if HAVE_LIBUNISTRING
> +            else
> +              {
> +                if (!(u8tab = u8_str_from_locale (optarg)))
> +                  error (EXIT_FAILURE, errno, _("error converting tab %s"),
> +                         quote (optarg));
> +              }
> +#endif
>
>   IMO, if the conversion cannot be performed or libunistring is not present,
>   you should use mbrtowc to test whether optarg contains only one character.
>   Or equivalently, test    mbslen (optarg) > 1.

OK.

>
> +#if HAVE_LIBUNISTRING
> +                    /* We support only single characters to support existing
> +                       documentation, and to restrict possible future 
> character
> +                       combinations, like the current "\0".  */
> +                    if (u8_base_chars (u8tab) > 1)
> +#endif
> +                      error (EXIT_FAILURE, 0, _("multi-character tab %s"),
> +                             quote (optarg));
>
>   Putting an if condition inside #if is confusing. I would avoid that style.
>   And, as said above, better consider POSIXLY_CORRECT here.
>
> - Comments in function u8_base_chars: The POSIX notion of "character" is a
>   multibyte character. When we convert multibyte characters to Unicode
>   characters, it is possible to map one POSIX character to 2 Unicode 
> characters.
>   Your function u8_base_chars returns the number of so-called
>   "complex characters"
>   <http://opengroup.org/onlinepubs/007908775/xcurses/intov.html>.
>   It appears correct to use UC_PROPERTY_COMBINING here (as opposed to just the
>   combining class).
>
> About the unit test: I would also add some tests with the EUC-JP encoding.
> It is so easy to assume that every 'const char *' is UTF-8 encoded, especially
> for future generation of programmers who did not grow up with ISO-8859-1, and
> especially when libunistring is in the minds of the programmers.

Yes, I will expand the unit test with other encodings.

thanks!
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

[coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/13
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/14
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/20
    - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/20
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Jim Meyering, 2010/09/15
- [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/19
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady <=
  - Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings, Eric Blake, 2010/09/20

Prev by Date: Re: [coreutils] cp enhancement ?
Next by Date: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Previous by thread: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Next by thread: Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Index(es):
- Date
- Thread