[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbcel module for Gnulib?
From: |
Bruno Haible |
Subject: |
Re: mbcel module for Gnulib? |
Date: |
Wed, 12 Jul 2023 00:14:45 +0200 |
[Removing diffutils-devel from CC.]
Paul Eggert wrote:
> However, mbiter's generality had a performance penalty.
>
> Some of the performance penalty is due to Gnulib's mbrtoc32 module
> replacing mbrtoc32 on glibc. As I understand it, this is due to glibc's
> mishandling of the C locale (it treats non-ASCII bytes as encoding
> errors). Such a bug should not affect diffutils, as diffutils uses
> mbrtoc32 only in multi-byte locales. So I'd like a way for diffutils to
> use the mbrtoc32 module without replacing mbrtoc32 on glibc. In the
> patch I just installed into diffutils on Savannah, this is done via a
> conditional "#undef mbrtoc32" (see attached) but this is a hack and
> there should be a better way.
>
> More of the performance penalty appears to be the mbiter module's
> support for arbitrary character encodings that don't happen in practice
I've added a benchmark of mbiter to gnulib, and removed a small
performance issue (mbsinit eating twice as much CPU time as needed).
The timings I see now are:
$ gltests/bench-mbiter abcdefghij 100000
Test Time What
---- ---- ----
Test a user 0.653 ASCII text, C locale
Test b user 0.618 ASCII text, UTF-8 locale
Test c user 1.841 French text, C locale
Test d user 1,487 French text, ISO-8859-1 locale
Test e user 1.509 French text, UTF-8 locale
Test f user 15.034 Greek text, C locale
Test g user 9,708 Greek text, ISO-8859-7 locale
Test h user 9.871 Greek text, UTF-8 locale
Test i user 4.584 Chinese text, UTF-8 locale
Test j user 4.747 Chinese text, GB18030 locale
The performance problems that I see are:
- glibc's conversion functions are optimized for long sequences
(think of iconv()). They are not optimized for short invocations
(one multibyte character or less). This is a long-standing problem,
that no one is attacking.
- glibc's UTF-8 converter is very slow for texts with many non-ASCII
characters (tests b, e, h, i). I don't think we can do anything about it.
I think why test h comes out twice as slow as test i is that the same
text in Greek needs more characters than the same text in Chinese
(every Hanzi character is worth 2 or more characters from an alphabet).
- In the C locale (tests a, c, f), conversions of bytes < 0x80 are
cheap, whereas conversions of bytes >= 0x80 are expensive, because
in this code path, glibc returns (size_t)-1 and mbrtoc32.c invokes
hard_locale.
Can we optimize the need for calling hard_locale so often, somehow?
Or create a variant of mbrtoc32 that fetches the value of hard_locale
from some cache (maybe a __thread variable)?
Or can hard_locale itself be optimized (through dirty, glibc specific
hacks)?
I do *not* see a performance problem with character encodings such as
ISO-8859-7 or GB18030 (tests g, j): the figures are comparable with UTF-8.
> I timed mbcel on the Emacs source code and it scanned the input
> significantly faster than mbiter did.
How can this be? The Emacs source code is mostly ASCII, and the figures
above (test a, b) show that for this case, mbiter is well optimized.
> I'm thinking that mbcel would be useful in Gnulib and in other GNU
> programs, and that we should create a mbcel module for it in Gnulib.
I'd better try to copy the worthy optimizations into mbiter, mbuiter.
The reason is that mbcel is not defining a new abstraction; it is thus
somewhere in between a standard mbrtoc32 and an mbiter_multi_next
invocation, and it would become more difficult to choose the right one
if there are three similar interfaces.
Candidates for optimization:
- The C locale handling
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
https://sourceware.org/bugzilla/show_bug.cgi?id=29511
It's now a clear POSIX violation. Would it make sense to get this fixed
in glibc, so that gnulib's override can be dropped on future glibc
versions?
To me, that would seem like a better approach than to have applications
declare whether they insist on a POSIX compliant mbrtowc or not.
- Is a functional interface faster than one that gets a 'struct' passed
by reference? I would guess no, since gcc optimizes both cases well,
especially when inlining. But feel free to prove me wrong.
- Resetting an mbstate_t: Should we define a function
void mbszero (mbstate_t *);
that clears the relevant part of an mbstate_t (i.e. 24 bytes instead
of 128 bytes on BSD systems)?
Advantage: performance.
Drawback: Yet another gnulib-invented, nonstandard API.
Bruno
- Re: From wchar_t to char32_t, (continued)
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/02
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/03
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/03
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/04
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/04
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/06
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/06
- mbcel module for Gnulib?, Paul Eggert, 2023/07/09
- Re: mbcel module for Gnulib?,
Bruno Haible <=
- Re: mbcel module for Gnulib?, Paul Eggert, 2023/07/12
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/13
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/16
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/16
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/17
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21