bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Paul Eggert
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Thu, 27 Jul 2023 12:19:30 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-24 17:01, Bruno Haible wrote:
biterf-bench-tests mbrtoc32-regular", together
with your mbcel benchmark patch from
https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00064.html
modified to use mbszero() (since some of the assumptions regarding
mbstate_t turned out to be unsafe)

Since that patch I fixed mbcel to account for the unsafe assumptions you mentioned, so mbcel shouldn't need to use mbszero any more.

The only platforms where this might matter are BSDish platforms which are lower priority for performance tuning anyway. That being said, performance seems to be OK on these platforms; see below.


The fact that here mbsinit shows up despite NDEBUG, indicates that you
haven't had mbrtoc32-regular included your testdir.

Ah, fair enough, I did the test over again with current Gnulib (commit 042a7042ac582e86693326943979b737c237a214) by running:

  gnulib-tool --create-testdir --dir a \
    mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular

then applying the attached patch, and running:

  gltests/bench-mbcel abcdefghij 1000000

This was with Ubuntu 23.04 and circa-2021 Xeon W-1350. Here are the results:

       noop  mbiterf mbuiterf    mbcel   mbucel
 a    1.546    2.038    1.931    1.639    1.968
 b    1.545    2.016    1.936    1.705    1.969
 c    2.153    8.160    8.255    6.160    5.550
 d    2,153    6,692    6,686    6,322    5,520
 e    2.234    6.866    6.817    6.214    5.781
 f    1.843   95.276  104.270   58.667   60.902
 g    1,844   76,126   83,899   65,541   66,983
 h    3.297   75.727   83.735   64.261   65.686
 i    1.941   31.064   34.887   26.262   27.072
 j    1.310   29.620   33.698   24.751   25.550

In some sense these benchmarks are biased against mbiter, as one can see further performance improvements to it (e.g., glibc fixing the C locale) that are fairly easy whereas mbcel will be harder to tune (e.g., improve GCC's code generation on x86-64).

But in another sense these benchmarks are biased against mbcel, as this benchmark compiles in a special environment that defines NDEBUG, which most Gnulib-using code doesn't do, and that gives mbiter an unfair advantage.

Furthermore, microbenchmarks like these don't capture the issue of mbiter/mbcel in more realistic applications, where other factors come into play. In a larger application mbiter's larger code footprint is more likely to exhaust the instruction cache, which these microbenchmarks don't measure.


Regarding the "is faster", I guess that Solaris and *BSD platforms will
comes out faster with mbu?iterf.

I guess they're about the same in practice. Here is the same benchmark (except with a smaller count, 100000), run on cfarm106.cfarm.net, which is Oracle Solaris 11.4 running on a circa 2011 SPARC T4-2.

       noop  mbiterf mbuiterf    mbcel   mbucel
 a    0.281    0.722    0.833    0.831    0.827
 b    0.280    0.701    0.826    0.822    0.826
 c    0.391    3.130    3.926    3.324    3.430
 d    0,391    3,470    4,306    3,678    3,791
 e    0.405    3.943    4.899    4.102    4.259
 f    0.338   38.610   49.275   38.868   40.131
 Skipping test: locale el_GR.ISO-8859-7 not installed.
 h    0.597   53.185   67.347   53.443   54.541
 i    0.352   23.186   29.628   23.484   23.707
 j    0.237   25.194   30.781   25.286   25.826

With this in mind mbcel performance should be good enough on these secondary porting targets. As mbcel usage is easier to understand and less likely to have screwups, it seems like a win.


   mbscasecmp (...)
   {
     #if _GL_MBSTATE_ZERO_SIZE >= 12 /* BSD, Solaris */
     ... mbiterf based implementation ...
     #else
     ... mbcel based implementation ...
     #endif
   }

Now, regarding the claim "the two versions have the same behavior":

Can you prove it? I mean, mathematically prove it?
Oh, you're right, Although I had assumed a single-byte locale or UTF-8, when inputs contain encoding errors the two implementations are not equivalent for multi-byte encodings like GB18030 that are not ASCII-safe.

Since there is a difference between the two, I prefer mbcel's single-byte per encoding-error interpretation (let's call it "SEE") to mbiter's multi-byte-per-encoding-error interpretation ("MEE"), as SEE is simpler and is common practice elsewhere, notably Emacs. For example:

printf -- '-*- coding: gb18030 -*-\n\201\302\n\20109\n\2010\2119\n' >GB18030.txt
  LC_ALL=zh_CN.gb18030 emacs GB18030.txt
printf -- '-*- coding: utf-8 -*-\n\344\274\256\n\20109\n\303\242\n' >UTF-8.txt
  LC_ALL=zh_CN.utf8 emacs UTF-8.txt

Both instances of Emacs display this:

  伮
  \20109
  â

where the first line is U+4F2E CJK IDEOGRAPH-4F2E, the second line's "\201" is colored differently to represent an encoding error, and the last line is U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX. Emacs here is taking the SEE interpretation, not the MEE interpretation where (in GB18030 only) the \201 byte followed by an ASCII "0" and ASCII "9" should be treated as an encoding error followed by "9", not as an encoding error followed by "0" and then "9".

The SEE/MEE distinction is a relatively minor issue in practice, as there is no difference in UTF-8, nowadays few people use legacy multi-byte encodings, and these few people don't care what mbscasecmp does with encoding errors so long as it is a valid ordering relation (which is true for both SEE and MEE).

For applications that need MEE, it'd be easy for mbcel to supply it. Something like the following patch would do it, though it should be a different function (e.g., mbcel_scanmee) instead of being a change to mbcel_scan itself. However, diffutils doesn't need to bother with such a change as it can use SEE like most other programs do.

 --- a/lib/mbcel.h
 +++ b/lib/mbcel.h
 @@ -191,3 +191,3 @@ mbcel_scan (char const *p, char const *lim)
    if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
 -    return (mbcel_t) { .err = *p, .len = 1 };
+ return (mbcel_t) { .err = *p, .len = len == (size_t) -2 ? lim - p : 1 };

Attachment: y.diff
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]