Re: Using libunistring for string comparisons et al

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using libunistring for string comparisons et al

From:	Mike Gran
Subject:	Re: Using libunistring for string comparisons et al
Date:	Wed, 16 Mar 2011 08:22:26 -0700 (PDT)

> From:Ludovic Courtès <address@hidden>

> > I know of two categories of bugs.  One has to do with case conversions
> > and case-insensitive comparisons, which must be done on entire strings
> > but are currently done for each character.  Here are some examples:
> >
> >   (string-upcase "Straße")         => "STRAßE"  
> (should be "STRASSE")
> >   (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"  
> (should be "χαoσς")
> >   (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  
> (should be "χαoς σ")
> >   (string-ci=? "Straße" "Strasse") => #f        
> (should be #t)
> >   (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        
> (should be #t)
> 
> (Mike pointed out that SRFI-13 does not consider these bugs, but that’s
> linguistically wrong so I’d consider it a bug.  Note that all these
> functions are ‘linguistically buggy’ anyway since they don’t have a
> locale argument, which breaks with Turkish ‘İ’.)
> 
> Can we first check what would need to be done to fix this in 2.0.x?
> 
> At first glance:
> 
>   - “Straße” is normally stored as a Latin1 string, so it would need to
>     be converted to UTF-* before it can be passed to one of the
>     unicase.h functions.  *Or*, we could check with bug-libunistring
>     what it would take to add Latin1 string case mapping functions.
> 
>     Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
>     one-to-one case mapping.  All other Latin1 strings can be handled by
>     iterating over characters, as is currently done.

There is the micro sign, which, when case folded, becomes a Greek mu.
It is still a single character, but, it is the only latin-1 character that,
when folded, becomes a non-Latin-1 character

> 
>     With this in mind, we could hack our way so that strings that
>     contain an ‘ß’ are stored as UTF-32 (yes, that’s a hack.)
> 
>   - For ‘string-downcase’, the Greek strings above are wide strings, so
>     they can be passed directly to u32_toupper & co.  For these, the fix
>     is almost two lines.
> 
>   - Case insensitive comparison is more difficult, as you already
>     pointed out.  To do it right we’d probably need to convert Latin1
>     strings to UTF-32 and then pass it to u32_casecmp.  We don’t have to
>     do the conversion every time, though: we could just change Latin1
>     strings in-place so they now point to a wide stringbuf upon the
>     first ‘string-ci=’.
> 
> Thoughts?

What about the srfi-13 case insensitive comparisons (the ones that don't
terminate in question marks, like string-ci<)?  Should they remain
as srfi-13 suggests, or should they remain similar in behavior
to the question-mark-terminated comparisons?

Mark is right that fixing this will not be pretty.  The case insensitive
string comparisons, for example, could be patched like the attached
snippet. If you don't find it too ugly of an approach, I could work on
a real patch.

Thanks,

Mike

strorder.c.patch
Description: Text Data

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Using libunistring for string comparisons et al, (continued)
- Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/15
- Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/15
- Re: Using libunistring for string comparisons et al, Mike Gran <=
  - Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/16
- Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/17

Prev by Date: stable-2.0: In procedure module-lookup: Unbound variable: for-each
Next by Date: Re: Using libunistring for string comparisons et al
Previous by thread: Re: Using libunistring for string comparisons et al
Next by thread: Re: Using libunistring for string comparisons et al
Index(es):
- Date
- Thread