Re: mbcel module for Gnulib?, incomplete multibyte sequences

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences

From:	Paul Eggert
Subject:	Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date:	Thu, 27 Jul 2023 12:19:30 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-24 17:01, Bruno Haible wrote:

biterf-bench-tests mbrtoc32-regular", together
with your mbcel benchmark patch from
https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00064.html
modified to use mbszero() (since some of the assumptions regarding
mbstate_t turned out to be unsafe)

Since that patch I fixed mbcel to account for the unsafe assumptions youmentioned, so mbcel shouldn't need to use mbszero any more.

The only platforms where this might matter are BSDish platforms whichare lower priority for performance tuning anyway. That being said,performance seems to be OK on these platforms; see below.

The fact that here mbsinit shows up despite NDEBUG, indicates that you
haven't had mbrtoc32-regular included your testdir.

Ah, fair enough, I did the test over again with current Gnulib (commit042a7042ac582e86693326943979b737c237a214) by running:


  gnulib-tool --create-testdir --dir a \
    mbiterf-bench-tests mbuiterf-bench-tests mbrtoc32-regular

then applying the attached patch, and running:

  gltests/bench-mbcel abcdefghij 1000000

This was with Ubuntu 23.04 and circa-2021 Xeon W-1350. Here are the results:

       noop  mbiterf mbuiterf    mbcel   mbucel
 a    1.546    2.038    1.931    1.639    1.968
 b    1.545    2.016    1.936    1.705    1.969
 c    2.153    8.160    8.255    6.160    5.550
 d    2,153    6,692    6,686    6,322    5,520
 e    2.234    6.866    6.817    6.214    5.781
 f    1.843   95.276  104.270   58.667   60.902
 g    1,844   76,126   83,899   65,541   66,983
 h    3.297   75.727   83.735   64.261   65.686
 i    1.941   31.064   34.887   26.262   27.072
 j    1.310   29.620   33.698   24.751   25.550

In some sense these benchmarks are biased against mbiter, as one can seefurther performance improvements to it (e.g., glibc fixing the C locale)that are fairly easy whereas mbcel will be harder to tune (e.g., improveGCC's code generation on x86-64).

But in another sense these benchmarks are biased against mbcel, as thisbenchmark compiles in a special environment that defines NDEBUG, whichmost Gnulib-using code doesn't do, and that gives mbiter an unfairadvantage.

Furthermore, microbenchmarks like these don't capture the issue ofmbiter/mbcel in more realistic applications, where other factors comeinto play. In a larger application mbiter's larger code footprint ismore likely to exhaust the instruction cache, which thesemicrobenchmarks don't measure.

Regarding the "is faster", I guess that Solaris and *BSD platforms will
comes out faster with mbu?iterf.

I guess they're about the same in practice. Here is the same benchmark(except with a smaller count, 100000), run on cfarm106.cfarm.net, whichis Oracle Solaris 11.4 running on a circa 2011 SPARC T4-2.


       noop  mbiterf mbuiterf    mbcel   mbucel
 a    0.281    0.722    0.833    0.831    0.827
 b    0.280    0.701    0.826    0.822    0.826
 c    0.391    3.130    3.926    3.324    3.430
 d    0,391    3,470    4,306    3,678    3,791
 e    0.405    3.943    4.899    4.102    4.259
 f    0.338   38.610   49.275   38.868   40.131
 Skipping test: locale el_GR.ISO-8859-7 not installed.
 h    0.597   53.185   67.347   53.443   54.541
 i    0.352   23.186   29.628   23.484   23.707
 j    0.237   25.194   30.781   25.286   25.826

With this in mind mbcel performance should be good enough on thesesecondary porting targets. As mbcel usage is easier to understand andless likely to have screwups, it seems like a win.

   mbscasecmp (...)
   {
     #if _GL_MBSTATE_ZERO_SIZE >= 12 /* BSD, Solaris */
     ... mbiterf based implementation ...
     #else
     ... mbcel based implementation ...
     #endif
   }

Now, regarding the claim "the two versions have the same behavior":

Can you prove it? I mean, mathematically prove it?

Oh, you're right, Although I had assumed a single-byte locale or UTF-8,when inputs contain encoding errors the two implementations are notequivalent for multi-byte encodings like GB18030 that are not ASCII-safe.

Since there is a difference between the two, I prefer mbcel'ssingle-byte per encoding-error interpretation (let's call it "SEE") tombiter's multi-byte-per-encoding-error interpretation ("MEE"), as SEE issimpler and is common practice elsewhere, notably Emacs. For example:

printf -- '-*- coding: gb18030 -*-\n\201\302\n\20109\n\2010\2119\n'>GB18030.txt

  LC_ALL=zh_CN.gb18030 emacs GB18030.txt

printf -- '-*- coding: utf-8 -*-\n\344\274\256\n\20109\n\303\242\n'>UTF-8.txt

  LC_ALL=zh_CN.utf8 emacs UTF-8.txt

Both instances of Emacs display this:

  伮
  \20109
  â

where the first line is U+4F2E CJK IDEOGRAPH-4F2E, the second line's"\201" is colored differently to represent an encoding error, and thelast line is U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX. Emacs here istaking the SEE interpretation, not the MEE interpretation where (inGB18030 only) the \201 byte followed by an ASCII "0" and ASCII "9"should be treated as an encoding error followed by "9", not as anencoding error followed by "0" and then "9".

The SEE/MEE distinction is a relatively minor issue in practice, asthere is no difference in UTF-8, nowadays few people use legacymulti-byte encodings, and these few people don't care what mbscasecmpdoes with encoding errors so long as it is a valid ordering relation(which is true for both SEE and MEE).

For applications that need MEE, it'd be easy for mbcel to supply it.Something like the following patch would do it, though it should be adifferent function (e.g., mbcel_scanmee) instead of being a change tombcel_scan itself. However, diffutils doesn't need to bother with such achange as it can use SEE like most other programs do.


 --- a/lib/mbcel.h
 +++ b/lib/mbcel.h
 @@ -191,3 +191,3 @@ mbcel_scan (char const *p, char const *lim)
    if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
 -    return (mbcel_t) { .err = *p, .len = 1 };

+ return (mbcel_t) { .err = *p, .len = len == (size_t) -2 ? lim - p: 1 };

y.diff
Description: Text Data

[Prev in Thread]

Current Thread

[Next in Thread]

Re: mbcel module for Gnulib?, incomplete multibyte sequences, (continued)

Prev by Date: Re: fnmatch: Overcome wchar_t limitations
Next by Date: Re: fnmatch: Overcome wchar_t limitations
Previous by thread: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Next by thread: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Index(es):
- Date
- Thread