guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using libunistring for string comparisons et al


From: Mark H Weaver
Subject: Re: Using libunistring for string comparisons et al
Date: Thu, 17 Mar 2011 13:58:42 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)

address@hidden (Ludovic Courtès) writes:
>> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
>> to UTF-8, along with a flag that indicates whether it is known to be
>> ASCII-only.
>
> The whole point of the narrow/wide distinction was to avoid
> variable-width encodings.  In addition, we’d end up with 3 cases (ASCII,
> UTF-8, or UTF-32) instead of 2, which seems quite complex to me.

Most functions would not care about the known_ascii_only flag, so really
it's just two cases.  (As you know, I'd prefer to have only one case).

> What do you think of moving to narrow = ASCII, as I suggested earlier?

The problem is that the narrow-wide cases will be much more common in
your scheme, and none of us has a good solution to those cases.  All the
solutions that handle those cases efficiently involve an unacceptable
increase in code complexity.  In your scheme, a large number of common
operations will require widening strings, which is bad for efficiency,
in both space and time for the common operations.

You may not realize the extent to which UTF-8's special properties
mostly eliminate the usual disadvantages of variable-width encodings.
Please allow me to explain how the most common string operations can be
implemented on UTF-8 strings.

* case-sensitive string comparison: Scan for the first unequal byte.
  The scan can be done bytewise, without paying any attention to UTF-8
  encoding.  Then you figure out how the differing characters compare to
  one another, which can be done in constant time.  Anyway, this is
  implemented by libunistring, and not our concern.

* case-insensitive string comparison: Same as above, but you might find
  that the differing characters are in the same equivalence class, and
  thus might have to restart the scan.  Anyway, this is implemented by
  libunistring, and not our concern.

* substring search: This can be implemented bytewise, exactly as if it
  was a fixed-width encoding.

* regexp search: The search itself can be implemented bytewise, exactly
  as if it was a fixed-width encoding.  Compiling the regexp can
  _almost_ be implemented as if the UTF-8-encoded regexp was in a
  fixed-width encoding, with just one added complication: a multibyte
  character followed by `*', `?' etc, must be compiled in such a way
  that the suffix operator applies to the whole character, and not just
  its final byte.  (In practice, it's probably more straightforward to
  handling compiling somewhat differently than outlined here, but you
  get the idea).

* parsing: Similarly, the parsing itself can be done bytewise, but
  compiling the grammar may require some considerations similar to the
  ones needed for compiling regexps.

Can you think of a real-world example where the variable-width encoding
of UTF-8 causes problems?

    Best,
     Mark



reply via email to

[Prev in Thread] Current Thread [Next in Thread]