[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbcel module for Gnulib?, incomplete multibyte sequences
From: |
Paul Eggert |
Subject: |
Re: mbcel module for Gnulib?, incomplete multibyte sequences |
Date: |
Thu, 27 Jul 2023 12:19:30 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 |
On 2023-07-24 17:01, Bruno Haible wrote:
biterf-bench-tests mbrtoc32-regular", together
with your mbcel benchmark patch from
https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00064.html
modified to use mbszero() (since some of the assumptions regarding
mbstate_t turned out to be unsafe)
Since that patch I fixed mbcel to account for the unsafe assumptions you
mentioned, so mbcel shouldn't need to use mbszero any more.
The only platforms where this might matter are BSDish platforms which
are lower priority for performance tuning anyway. That being said,
performance seems to be OK on these platforms; see below.
The fact that here mbsinit shows up despite NDEBUG, indicates that you
haven't had mbrtoc32-regular included your testdir.
Ah, fair enough, I did the test over again with current Gnulib (commit
042a7042ac582e86693326943979b737c237a214) by running:
gnulib-tool --create-testdir --dir a \
mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular
then applying the attached patch, and running:
gltests/bench-mbcel abcdefghij 1000000
This was with Ubuntu 23.04 and circa-2021 Xeon W-1350. Here are the results:
noop mbiterf mbuiterf mbcel mbucel
a 1.546 2.038 1.931 1.639 1.968
b 1.545 2.016 1.936 1.705 1.969
c 2.153 8.160 8.255 6.160 5.550
d 2,153 6,692 6,686 6,322 5,520
e 2.234 6.866 6.817 6.214 5.781
f 1.843 95.276 104.270 58.667 60.902
g 1,844 76,126 83,899 65,541 66,983
h 3.297 75.727 83.735 64.261 65.686
i 1.941 31.064 34.887 26.262 27.072
j 1.310 29.620 33.698 24.751 25.550
In some sense these benchmarks are biased against mbiter, as one can see
further performance improvements to it (e.g., glibc fixing the C locale)
that are fairly easy whereas mbcel will be harder to tune (e.g., improve
GCC's code generation on x86-64).
But in another sense these benchmarks are biased against mbcel, as this
benchmark compiles in a special environment that defines NDEBUG, which
most Gnulib-using code doesn't do, and that gives mbiter an unfair
advantage.
Furthermore, microbenchmarks like these don't capture the issue of
mbiter/mbcel in more realistic applications, where other factors come
into play. In a larger application mbiter's larger code footprint is
more likely to exhaust the instruction cache, which these
microbenchmarks don't measure.
Regarding the "is faster", I guess that Solaris and *BSD platforms will
comes out faster with mbu?iterf.
I guess they're about the same in practice. Here is the same benchmark
(except with a smaller count, 100000), run on cfarm106.cfarm.net, which
is Oracle Solaris 11.4 running on a circa 2011 SPARC T4-2.
noop mbiterf mbuiterf mbcel mbucel
a 0.281 0.722 0.833 0.831 0.827
b 0.280 0.701 0.826 0.822 0.826
c 0.391 3.130 3.926 3.324 3.430
d 0,391 3,470 4,306 3,678 3,791
e 0.405 3.943 4.899 4.102 4.259
f 0.338 38.610 49.275 38.868 40.131
Skipping test: locale el_GR.ISO-8859-7 not installed.
h 0.597 53.185 67.347 53.443 54.541
i 0.352 23.186 29.628 23.484 23.707
j 0.237 25.194 30.781 25.286 25.826
With this in mind mbcel performance should be good enough on these
secondary porting targets. As mbcel usage is easier to understand and
less likely to have screwups, it seems like a win.
mbscasecmp (...)
{
#if _GL_MBSTATE_ZERO_SIZE >= 12 /* BSD, Solaris */
... mbiterf based implementation ...
#else
... mbcel based implementation ...
#endif
}
Now, regarding the claim "the two versions have the same behavior":
Can you prove it? I mean, mathematically prove it?
Oh, you're right, Although I had assumed a single-byte locale or UTF-8,
when inputs contain encoding errors the two implementations are not
equivalent for multi-byte encodings like GB18030 that are not ASCII-safe.
Since there is a difference between the two, I prefer mbcel's
single-byte per encoding-error interpretation (let's call it "SEE") to
mbiter's multi-byte-per-encoding-error interpretation ("MEE"), as SEE is
simpler and is common practice elsewhere, notably Emacs. For example:
printf -- '-*- coding: gb18030 -*-\n\201\302\n\20109\n\2010\2119\n'
>GB18030.txt
LC_ALL=zh_CN.gb18030 emacs GB18030.txt
printf -- '-*- coding: utf-8 -*-\n\344\274\256\n\20109\n\303\242\n'
>UTF-8.txt
LC_ALL=zh_CN.utf8 emacs UTF-8.txt
Both instances of Emacs display this:
伮
\20109
â
where the first line is U+4F2E CJK IDEOGRAPH-4F2E, the second line's
"\201" is colored differently to represent an encoding error, and the
last line is U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX. Emacs here is
taking the SEE interpretation, not the MEE interpretation where (in
GB18030 only) the \201 byte followed by an ASCII "0" and ASCII "9"
should be treated as an encoding error followed by "9", not as an
encoding error followed by "0" and then "9".
The SEE/MEE distinction is a relatively minor issue in practice, as
there is no difference in UTF-8, nowadays few people use legacy
multi-byte encodings, and these few people don't care what mbscasecmp
does with encoding errors so long as it is a valid ordering relation
(which is true for both SEE and MEE).
For applications that need MEE, it'd be easy for mbcel to supply it.
Something like the following patch would do it, though it should be a
different function (e.g., mbcel_scanmee) instead of being a change to
mbcel_scan itself. However, diffutils doesn't need to bother with such a
change as it can use SEE like most other programs do.
--- a/lib/mbcel.h
+++ b/lib/mbcel.h
@@ -191,3 +191,3 @@ mbcel_scan (char const *p, char const *lim)
if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
- return (mbcel_t) { .err = *p, .len = 1 };
+ return (mbcel_t) { .err = *p, .len = len == (size_t) -2 ? lim - p
: 1 };
y.diff
Description: Text Data
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, (continued)
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/25
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/22
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences,
Paul Eggert <=
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/28
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/26
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/10
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/11
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/11
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/11
- Re: From wchar_t to char32_t, Bruno Haible, 2023/07/13
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/13
- Re: From wchar_t to char32_t, new module mbszero, Bruno Haible, 2023/07/16
- Re: From wchar_t to char32_t, new module mbszero, Paul Eggert, 2023/07/16