bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc


From: Bruno Haible
Subject: Re: horrible utf-8 performace in wc
Date: Sat, 7 Jun 2008 00:43:30 +0200
User-agent: KMail/1.5.4

Pádraig Brady wrote:
> There have been some interesting "counting UTF-8 strings" threads
> over at reddit lately, all referenced from this article:
> http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

But before these techniques can be used in practice in packages such as
coreutils, two problems would have to be solved satisfactorily:

  1) "George Pollard makes the assumption that the input string is valid UTF-8".
     This assumption cannot be upheld, as long as you use the same type
     ('char *') for UTF-8 encoded strings and normal C strings, or when
     you occasionally convert between one and the other.

     For example: Assume NAME is really a valid UTF-8 string.
     A program then does

       static char buf[20];
       snprintf (buf, "%s", NAME);
       utf8_strlen (buf);

     Boing! You already have a buffer overrun: The snprintf can truncate
     an UTF-8 character, and the utf8_strlen function then skips over the
     terminating NUL byte and scans buf[21...infinity], and likely crashes.

  2) We already have the problem that we want to keep good performance when
     handling strings in the "C" locale or, more generally, in a unibyte locale.
     So we get code duplication:
       - code for unibyte locales,
       - code for multibyte locales that uses mbrtowc().
     If you want to optimize UTF-8 locales particularly, i.e. optimize away
     the function calls inherent in mbrtowc(), then we get code triplication:
       - code for unibyte locales,
       - code for UTF-8 locales,
       - code for multibyte locales other than UTF-8, that uses mbrtowc().
     So, code size increases, and the testing requirements increase as well.

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]