[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: seq feature: print letters
From: |
Assaf Gordon |
Subject: |
Re: seq feature: print letters |
Date: |
Mon, 30 Jun 2014 13:39:52 -0600 |
>> On Jun 30, 2014, at 5:24, Pádraig Brady <address@hidden> wrote:
>>
>> On 06/30/2014 11:23 AM, address@hidden wrote:
>> I'd like to suggest a patch to allow seq to generate letter sequences.
> I notice about 45 copies of the A-Z alphabet, would it be worth introducing
> aliases to avoid copies?
Yes, we can consolidate them.
> What about case. The current code only has upper case. case is a can of worms
> I know, with not necessarily 1:1 mapping etc.
Once leaving the realm of latin languages, upper/lower case indeed becomes very
complicated. Or even meaningless. I thought that 'tr [:upper:] [:lower:]' would
handle it better (but I now realize tr doesn't support UTF-8 well, if I
understand correctly).
I think that for the first step, we should not deal with upper/lower case
issues.
> The data being leveraged is well defined at present reasonable to include
> directly in the seq binary (about 12K I'm guessing),
> though have you looked at whether libunistring contains the appropriate
> data/logic for this?
> This might be more significant if case or more characters were considered for
> example.
This first draft stores UTF-8 strings (with NUL) for each character. I saw the
libunistring code stores some bit-fields for some of the functions, though I
haven't learned it yet.
I will try to improve the storage method in following patches.
> I had a quick look at the CLDR. Are you only considering the "Index exemplar"
> chars here?
> http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html
Exactly.
> Maybe it would be better to default to the "standard exemplars"?
> http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters
The reason I liked to "index" list, is because it most directly answers the
question "what is the alphabet in language X" ? (is in, what are the letters
that would be taught in schools as "the alphabet", or if you ask a person on
the street to list the alphabet letters).
It also lends itself to do:
# How many letters are in the Arabic alphabet:
seq --alphabet=ar | wc -l
# What is the eleventh letter in the Russian alphabet:
seq --alphabet=ru | awk 'NR==11'
Technically, the functionality of "is_alpha()" does not correspond 1:1 to "the
alphabet", which is part of the problem... In English, there are no
complications, but in many other languages it becomes complicated.
Using other Unicode categories (e.g.the 'main' letters or even 'auxiliary'
letters) answers a slightly different question, more akin to "what symbols are
acceptable in language X ?" - not a bad question, just different that the
previous question.
For example in Hebrew, the "index" list contains 22 letters (which agrees with
the question "how many letters are in the Hebrew alphabet"), but the
"main/standard" list has 5 more symbols, of 5 hebrew letters that have specific
"final" form (if those letters appear at the end of the word).
So using the "main" list would list 5 letters twice. I believe other language
such as Arabic would present similar issues.
From a technical point of view, it's easy to include both "index" and
"standard" letters (with different command-line options), it's just a matter of
adding more lists.
What do you think?
Thanks,
-Gordon