[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: horrible utf-8 performace in wc
From: |
Bruno Haible |
Subject: |
Re: horrible utf-8 performace in wc |
Date: |
Sat, 7 Jun 2008 00:43:30 +0200 |
User-agent: |
KMail/1.5.4 |
Pádraig Brady wrote:
> There have been some interesting "counting UTF-8 strings" threads
> over at reddit lately, all referenced from this article:
> http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
But before these techniques can be used in practice in packages such as
coreutils, two problems would have to be solved satisfactorily:
1) "George Pollard makes the assumption that the input string is valid UTF-8".
This assumption cannot be upheld, as long as you use the same type
('char *') for UTF-8 encoded strings and normal C strings, or when
you occasionally convert between one and the other.
For example: Assume NAME is really a valid UTF-8 string.
A program then does
static char buf[20];
snprintf (buf, "%s", NAME);
utf8_strlen (buf);
Boing! You already have a buffer overrun: The snprintf can truncate
an UTF-8 character, and the utf8_strlen function then skips over the
terminating NUL byte and scans buf[21...infinity], and likely crashes.
2) We already have the problem that we want to keep good performance when
handling strings in the "C" locale or, more generally, in a unibyte locale.
So we get code duplication:
- code for unibyte locales,
- code for multibyte locales that uses mbrtowc().
If you want to optimize UTF-8 locales particularly, i.e. optimize away
the function calls inherent in mbrtowc(), then we get code triplication:
- code for unibyte locales,
- code for UTF-8 locales,
- code for multibyte locales other than UTF-8, that uses mbrtowc().
So, code size increases, and the testing requirements increase as well.
Bruno