[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Using libunistring for string comparisons et al
From: |
Mike Gran |
Subject: |
Re: Using libunistring for string comparisons et al |
Date: |
Thu, 17 Mar 2011 11:07:58 -0700 (PDT) |
> From:Ludovic Courtès <address@hidden>
> >> Can we first check what would need to be done to fix this in 2.0.x?
> >>
> >> At first glance:
> >>
> >> - “Straße” is normally stored as a Latin1 string, so it would need to
> >> be converted to UTF-* before it can be passed to one of the
> >> unicase.h functions. *Or*, we could check with bug-libunistring
> >> what it would take to add Latin1 string case mapping functions.
> >>
> >> Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a
> >> one-to-one case mapping. All other Latin1 strings can be handled
> by
> >> iterating over characters, as is currently done.
> >
> > There is the micro sign, which, when case folded, becomes a Greek mu.
> > It is still a single character, but, it is the only latin-1 character that,
> > when folded, becomes a non-Latin-1 character
>
> Blech.
>
> It would have worked better with narrow == ASCII instead of
> narrow == Latin1. It’s a change we can still make, I think.
It would be easy enough to do. If someone were to fight for
a narrow encoding of Latin-1, I would expect it to be you, since
you're the only committer whose name requires ISO-8859-1.
So if you're okay with it, who am I to complain?
>
> >> - Case insensitive comparison is more difficult, as you already
> >> pointed out. To do it right we’d probably need to convert Latin1
> >> strings to UTF-32 and then pass it to u32_casecmp. We don’t have
> to
> >> do the conversion every time, though: we could just change Latin1
> >> strings in-place so they now point to a wide stringbuf upon the
> >> first ‘string-ci=’.
> >>
> >> Thoughts?
> >
[...]
>
> Indeed it’s quite inelegant. ;-)
>
> How about changing to narrow == ASCII and then string comparison would
> be:
>
> if (narrow (s1) != narrow (s2))
> {
It would be easier and cleaner, as you demonstrate.
I guess the question is about future-proofing. If the complications with the
Latin-1
/ UTF-32 dual encoding are constrained to upcase/downcase and string-ci
comparison ops, then it doesn't seem worth it to change it. But if it is going
to cause endless problems down the road, ASCII/UTF-32 is simpler.
A lot of this debate is about expectations, I think. For my part, I think that
the string-ci ops only have real value for English language and ASCII text.
For non-English non-ASCII processing, sorting case-insensitively by numeric
codepoint values in the absence of locale sorting rules seems like an odd thing
to want to do.
So I guess I'm not bothered with the ugly C necessary to make ISO-8859-1 work.
It is bad for string-ci ops but not too bad for upcase/downcase. I also am
not too concerned that string-ci comparison ops for non-English non-ASCII
processing may be inefficient. It does seem vital that string-locale comparison
ops be efficient, though.
Thanks,
Mike
Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/15
Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/15
Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/16
Re: Using libunistring for string comparisons et al,
Mike Gran <=