guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using libunistring for string comparisons et al


From: Mark H Weaver
Subject: Re: Using libunistring for string comparisons et al
Date: Thu, 17 Mar 2011 21:38:28 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)

Thien-Thi Nguyen <address@hidden> writes:
> In unibyte land, "." matches a byte.  OK.
>
> In multibyte land done "bytewise", "." matches ____________.
> (What goes in the blank?)

"." (and more generally [^...]) is equivalent to (a|b|c|d|...) where
every valid UTF-8 character is present in the disjunction except for the
excluded characters.  However, this case does indeed require special
consideration for the sake of efficiency in the regexp compiler.

If we may assume that the searched string is valid UTF-8, and when only
ASCII characters are excluded (e.g. "."), then three additional states
are required in the generated DFA.  Let us call them S1, S2, and S3.  S1
accepts one arbitrary byte, i.e. it has an outgoing edge for every byte
which leads to the final state.  S2 accepts any two bytes, i.e. it has
an outgoing edge for every byte which leads to S1.  Similarly for S3.

Now, the start state of "." (or [^...]) has outgoing edges for every
byte from 0x80 to 0xFF, leading to S1, S2, or S3, depending on the high
bits of the first byte.  Specifically, byte values 0xC0 through 0xDF
lead to S1, 0xE0 through 0xEF lead to S2, and 0xF0 through 0xF7 lead to
S3.

When non-ASCII characters are excluded, additional states must be added:
one for each unique prefix of the excluded multibyte characters.  It's
quite straightforward.

   Regards,
     Mark



reply via email to

[Prev in Thread] Current Thread [Next in Thread]