[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Using libunistring for string comparisons et al
From: |
Mark H Weaver |
Subject: |
Re: Using libunistring for string comparisons et al |
Date: |
Fri, 18 Mar 2011 08:05:01 -0400 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) |
Thien-Thi Nguyen <address@hidden> writes:
> () Mark H Weaver <address@hidden>
> () Thu, 17 Mar 2011 21:38:28 -0400
>
> If we may assume that the searched string is valid UTF-8, and when only
> ASCII characters are excluded (e.g. "."), then three additional states
> are required in the generated DFA. Let us call them S1, S2, and S3.
>
> [handling these states]
>
> When non-ASCII characters are excluded, additional states must be added:
> one for each unique prefix of the excluded multibyte characters. It's
> quite straightforward.
>
> I don't understand what "excluded" means here. Is this a property
> of the string, the regexp, the (dynamic, environmental) operation, or ...?
The excluded characters are the ones listed inside of a [^...] form.
For example, [^abc] excludes only the ASCII characters #\a, #\b, and
#\c. [^abcλ] excludes also the non-ASCII character #\λ.
"." is equivalent to a [^...] form that excludes only newlines.
I should also note that although UTF-8 causes a few complications when
compiling regexps, UTF-32 creates much worse problems that require a
different DFA data structure and a slower scan.
In the UTF-8 case, each state in the DFA can contain a fixed lookup
table with 256 entries (one for each byte) as is used for ASCII or
Latin1. However, in the UTF-32 case, it is _not_ practical for each
state to contain a lookup table with over 1 million entries (one for
each code point).
Mark
- Re: Using libunistring for string comparisons et al, (continued)
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/15
- Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/15
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/15
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/16
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/17
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/17
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/17
- Re: Using libunistring for string comparisons et al, Thien-Thi Nguyen, 2011/03/17
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/17
- Re: Using libunistring for string comparisons et al, Thien-Thi Nguyen, 2011/03/18
- Re: Using libunistring for string comparisons et al,
Mark H Weaver <=
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/20
- Re: Using libunistring for string comparisons et al, Andy Wingo, 2011/03/30
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/17
- Re: Using libunistring for string comparisons et al, Andy Wingo, 2011/03/19
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/19
- Re: Using libunistring for string comparisons et al, Noah Lavine, 2011/03/19
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/19
- Re: Using libunistring for string comparisons et al, Andy Wingo, 2011/03/19
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/19
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/19