bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17130: 24.4.50; Deficient Unicode case folding


From: Nathan Trapuzzano
Subject: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 11:29:43 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Eli Zaretskii <eliz@gnu.org> writes:

>> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
>> fold to σ.
>
> So you would need to search all characters to find those which have σ
> in the CANONICALIZE slot -- not very efficient, to say the least.

Doesn't this already happen?  If not, then what is the CANONICALIZE slot
doing that couldn't be done with the regular upcase/downcase slots by
themselves?

> IOW, what you suggest will provide a one-way mapping, whereas we need
> a two-way mapping.

Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
least in principle.

>> > Besides, don't we also need to know that ς can only be present at the
>> > end of a word?
>> 
>> Don't think so.  AFAIK, Unicode says nothing about ordering except when
>> it comes to combining characters.  But even it did prescribe such a
>> rule, I don't think it would have anything to do with case folding.
>
> Who said this is only about case folding?

I should have said just "case", not "case folding".

> Emacs should use this data for up-casing and down-casing as well, for
> example, so that M-l downcases Σ to ς, not σ, when it is at the end of
> the word.  Wouldn't users of Greek expect that?

Maybe.  I'm just saying that Unicode itself doesn't prescribe or even
recommend such behavior.  It defines case conversions independently of
ordering.

That said, making M-l downcase terminal Σ to ς would be a nice feature
that could be enabled, e.g., by enabling a minor mode or by modifying
some *-functions variable of functions that get called before the normal
behavior of M-l is applied, etc.  But it shouldn't have anything to do
with Unicode-compliant case-insensitive searching.

>> Right, but what I'm asking is: if Emacs doesn't do Unicode case folding,
>> what is the purpose of the CANONICALIZE slot except as a kind of
>> placeholder that gets autofilled?
>
> Whenever you need the canonical equivalent of a character, such as in
> case-insensitive search, you need that slot.

But there's nothing about the slot that mandates that only _pairs_ can
be case-equivalent under case folding.  Indeed, the manual speaks of
"sets" of chracters that might be equivalent under case-folding, hence
my understanding that σ, ς, and Σ can all have σ in their CANONICALIZE
slot, and that's all it would take.

(Btw, I'm using "case-insensitive" to mean the same as "under
case-folding".)

>> Are there other kinds of case folding--other than traditional
>> upper/lower and Unicode--that I'm not aware of?
>
> There's "title case", of course.  

I think title case would require an extra slot in the case table.

> There are also characters whose case pair is not a single character,
> but several, like the upper-case variant of ß in German.

Good point.  "ß" should fold to "ss".  I guess for the CANONICALIZE slot
to suffice, it would have to map to a string, not a code point.

> Personally, I think we need an additional slot for what you want, and
> code to use it.

Given the point about ß, you're probably right.  Unless we can make
entries in the CANONICALIZE slot be strings rather than code points.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]