[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: idn.el and confusables.txt
From: |
Ted Zlatanov |
Subject: |
Re: idn.el and confusables.txt |
Date: |
Sat, 14 May 2011 20:22:44 -0500 |
User-agent: |
Gnus/5.110018 (No Gnus v0.18) Emacs/24.0.50 (gnu/linux) |
On Sat, 14 May 2011 23:59:22 +0300 Eli Zaretskii <address@hidden> wrote:
>> Let's say C1, C2, and C3 are confusables mapped to C1. Then the mapping
>> is C1 -> (C2, C3); C2 -> C1; and C3 -> C1.
>>
>> The algorithm is "if a character maps to an atom it's confusable with
>> it, if it maps to a list the whole lisp is confusable to this
>> character."
EZ> Should it be a list or a string? How would you use this mapping?
It could be any type of sequence, I guess. Strings are more compact but
for small amounts of data (typically 1-3 characters) I'm not sure if
that matters. For 1 character in particular I'm pretty sure it's more
efficient to store the character directly than any sequence.
markchars.el would use it as follows: look at all the characters of a
word. If any are of a different script S2 from the majority script S1,
highlight them (we do this now with `markchars-face-confusable').
New functionality: now if any of the S2 characters are multi-script
confusables that map to a character in the majority script S1, highlight
them specially with the new variable
`markchars-face-confusable-multi-script' and give them a tooltip to say
they are confusable with a particular character.
New functionality: if any of the word characters, regardless of script,
are confusables of the single-script type, highlight them with
`markchars-face-confusable'. But see below about normalization.
EZ> The RHS of a mapping can be several characters, in which case there's
EZ> no reverse mapping and no "confusables mapped to a character", I
EZ> think.
OK. I was thinking of using the transitivity information but that's not
very useful so never mind.
>> In addition to the character mapping we also need a confusable data
>> type, which can be SL/SA (single-script) or ML/MA (mixed-script).
EZ> What would be a possible use of that?
Single-script confusables can be an accident and are usually due to
combining, e.g. parenthesized numbers:
2485 ; 0028 006C 0038 0029 ; SL #* ( ⒅ → (l8) ) PARENTHESIZED NUMBER
EIGHTEEN → LEFT PARENTHESIS, LATIN SMALL LETTER L, DIGIT EIGHT, RIGHT
PARENTHESIS # →(18)→
...although there are many cases where that's not true:
0399 ; 0031 ; SA # ( Ι → 1 ) GREEK CAPITAL LETTER IOTA → DIGIT ONE
# →l→
0417 ; 0033 ; SA # ( З → 3 ) CYRILLIC CAPITAL LETTER ZE → DIGIT THREE
#
As a general rule I'd say that if the mapping is to a single character
with the SL/SA single-script property, chances are it's a true
confusable. Otherwise it could be legitimate and we'd need to convert
the string to a normalized form, which is probably slow (do you know?)
Mixed-script confusables are more dangerous because they look exactly
like the other character and are less likely to be an accident, e.g.
FF01 ; 0021 ; ML #* ( ! → ! ) FULLWIDTH EXCLAMATION MARK → EXCLAMATION
MARK # →ǃ→
0430 ; 0061 ; ML # ( а → a ) CYRILLIC SMALL LETTER A → LATIN SMALL
LETTER A #
so I would make them more noticeable and would skip any normalization.
Thus my new functionality proposals above.
There are also whole-script confusables, e.g. "scope" in Latin and
"scope" in Cyrillic (example from http://unicode.org/reports/tr39/) but
I think those are covered by the rules above already and don't merit
special treatment.
Finally, confusables.txt has transitivity mappings that explain how the
mapping was derived. I don't think that's particularly useful for
markchars.el. I can't think of any other uses for the confusables.txt
data beyond the listed above.
Based on all this, I think it's best to make the confusables char-table
values atoms or sequences (strings or lists) but split them into two
char-tables for the single-script and multi-script mappings.
Ted
- Re: idn.el and confusables.txt, (continued)
- Re: idn.el and confusables.txt, Eli Zaretskii, 2011/05/14
- Re: idn.el and confusables.txt, Lennart Borgman, 2011/05/14
- Re: idn.el and confusables.txt, Ted Zlatanov, 2011/05/14
- Re: idn.el and confusables.txt, Eli Zaretskii, 2011/05/14
- Re: idn.el and confusables.txt, Ted Zlatanov, 2011/05/14
- Re: idn.el and confusables.txt, Eli Zaretskii, 2011/05/14
- Re: idn.el and confusables.txt, Ted Zlatanov, 2011/05/14
- Re: idn.el and confusables.txt, Eli Zaretskii, 2011/05/14
- Re: idn.el and confusables.txt,
Ted Zlatanov <=
- Re: idn.el and confusables.txt, Eli Zaretskii, 2011/05/15
- Re: idn.el and confusables.txt, Ted Zlatanov, 2011/05/15
- Re: idn.el and confusables.txt, Eli Zaretskii, 2011/05/16
- Re: idn.el and confusables.txt, Ted Zlatanov, 2011/05/16
- Re: idn.el and confusables.txt, Eli Zaretskii, 2011/05/17
- Re: idn.el and confusables.txt, Ted Zlatanov, 2011/05/17
- Re: idn.el and confusables.txt, Ted Zlatanov, 2011/05/18
- Re: idn.el and confusables.txt, Stefan Monnier, 2011/05/14
Re: idn.el and confusables.txt, Kenichi Handa, 2011/05/15