guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using libunistring for string comparisons et al


From: Mike Gran
Subject: Re: Using libunistring for string comparisons et al
Date: Sat, 12 Mar 2011 13:28:05 -0800 (PST)

> From:Mark H Weaver <address@hidden>

> 
> address@hidden (Ludovic Courtès) writes:
> > I find Cowan’s proposal for string iteration and the R6RS editors
> > response interesting:
> >
>http://www.r6rs.org/formal-comments/comment-235.txt
> 
> Cowan was proposing a complex new API.  I am not, nor did Gauche.
> An efficient implementation of string ports is all that is needed.
> 
> > I also think strings should remain what they currently are, with O(1)
> > random access.
> 
> I understand your position, and perhaps you are right.
> 
> Unfortunately, the alternatives are not pleasant.  We have a bunch of
> bugs in our string handling functions.  Currently, our case-insensitive
> string comparisons and case conversions are not correct for several
> languages including German, according to the R6RS among other things.

(Just as an aside, in discussions like this, I think it important to distinguish
between
- where Guile doesn't match R6RS
- where R6RS doesn't match Unicode
- and where Unicode doesn't match reality

Each of these battles needs to be fought in the proper battlefield.)

> We could easily fix these problems by using libunistring, which provides
> the operations we need, but only if we use a single string
> representation, and one that is supported by libunistring (UTF-8,
> UTF-16, or UTF-32).

We do, in a matter of speaking, have a single string representation: UTF-32.
The 'narrow' encoding is UTF-32 with the initial 3 bytes of zero removed.

> Our use of two different internal string representations is another
> problem.  Right now, our string comparisons are painfully inefficient.
> Take a look at compare_strings in srfi-13.c.  It's also broken w.r.t.
> case-insensitive comparisons.  In order to fix this and make it
> efficient, we'll need to make several different variants:
> 
>   * case-sensitive
>      * narrow-narrow
>      * narrow-wide
>      * wide-wide (use libunistring's u32_cmp2 for this)
> 
>   * case-insensitive
>      * narrow-narrow
>      * narrow-wide
>      * wide-wide (use libunistring for this)
> 
> The case-insensitive narrow-narrow comparison must be able to handle
> this, for example (from r6rs-lib):
> 
>   (string-ci=? "Straße" "Strasse") => #t
> 
> I'm not yet sure what's involved in implementing the case-insensitive
> narrow-wide case properly.

It is not too difficult in practice, I think.  Converting narrow (Latin-1
aka truncated UTF-32) to wide (UTF-32) involves adding back
in the missing zero bytes.

For the sake of history, here's how we got to where we are now.
- R6RS says characters are Unicode codepoints
- R6RS says string ops are O(1)
- The only Unicode encoding that uses codepoints as its atomic units and 
  is O(1) is UTF-32
- UTF-32 wastes space for most normal circumstances
Thus we invented this wide (UTF-32) narrow (UTF-32 with initial
zeros stripped) encoding scheme we have now.  It may seem
to be suboptimal in terms of complexity and familiarity, but, 
what you see is an attempt to be optimal in terms of memory and
O(1).

I actually at one point had a nearly complete version of Guile 1.8
that used UTF-8 and another that used UTF-32.  There are some
other reasons why UTF-8 is bad, which I could bore you with
ad naseum.

Thanks,

Mike




reply via email to

[Prev in Thread] Current Thread [Next in Thread]