groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] Collating sequence for sorting words, etc.


From: Ted Harding
Subject: [Groff] Collating sequence for sorting words, etc.
Date: Sun, 12 Aug 2001 13:05:21 +0100 (BST)

Hi Folks,

A query which has just come up on comp.software.international
reminds me that there is an issue which I have been meaning
to raise for discussion on groff. It concerns the sorting
of words into "alphabetical order" for index-entries,
lists of references, etc.

So-called "alphabetical order" (more correctly "collating order")
varies from language to language, and even from context to context.

For instance, in French "c-cedilla" is considered equivalent to
plain "c", so that sorted "c-cedilla" words are mingled with "c" words,
while in Turkish "c-cedilla" follows "c" in collating order, so that
all words beginning with "c-cedilla" follow all words beginning with "c".
Similarly in, e.g., Danish characters like a-ring, ae-digraph are
collated at the end of the alphabet (after "z"). And in Swedish,
"o-dieresis" follows "z" while in German it is equivalent to plain "o".
(as far as I know).

And, as an example of context, I had occasion to prepare an index
for a book about Turkey. It was required that Turkish words and names
should be collated according to Turkish rules as above (so that
"c-cedilla" followed "c", "dotted i" followed "dotless-i", etc.),
while _all_ other words and names followed "English" practice even
if Danish/Swedish etc. (so that "ae-digraph" was equivalent to
the two-letter sequence "ae", "a-dieresis"" was equivalent to "a",
and so on).

The only way to cope with this sort of thing is to use proper
software designed for the job. One example is the "makeindex"
program that is part of the "TeX" text-formatting suite; this
has all the flexibility required. And it can be used with groff.

The basic method used is that a word is _sorted_ on a "sort key"
which may not be identical to the word itself; while what gets
printed in the index is something different -- the "index entry".
Creating an index involves entering both of these in every case where
the are different (where they are identical, the index entry is
by default also the sort key). [Thus, "c-cedilla" was replaced
by "c~" in the sort key for Turkish words, since this forces the
order, while "c-cedilla" was replaced by plain "c" for non-Turkish
words.]

Such issues aside, the basic requirement for correct collation of
characters in an alphabet would be to specify a "collation sequence",
which is a list of the characters in the order in which they should
be sorted (with groupings for characters considered equivalent, such
as "a-dieresis" and "a" in German, possibly with a group of characters
which should be ignored). You can then use different sequences for
different languages.

There is no such provision in groff. I first hit this years ago
when noticing that "refer" had its own notion of collation, with
no provision for changing this. In any case, I subsequently
changed to using 'makeindex' and 'mkbib', which are more flexible
anyway and allow the sort of thing I describe above.

Nevertheless, I think that being able to specify a "collation
sequence" (using much the same sort of mechanism as specifying
hyphenation rules, also language-specific) could be a useful
addition. You could easily define macros to switch language
context in the middle of a document (exactly as you can for
switching hyphenation files if you have to).

Such a "collation file" would need to cope with characters
specified by groff escape sequences, so that you could
have something like

   A a
   B b
   C c
   Ç ç \(C, \(c, \[C-cedilla] \[c-cedilla]

(where entries on the same line are equivalent, and the order
of lines is the sorting order, with groff escape-sequences
depending on what names you have defined your characters).

On the whole, however, I still favour using external programs
for this kind of job, and maybe this kind of provision within
groff itself is no needed (except maybe for 'refer').

What are other people's views?

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <address@hidden>
Fax-to-email: +44 (0)870 167 1972
Date: 12-Aug-01                                       Time: 13:05:21
------------------------------ XFMail ------------------------------

reply via email to

[Prev in Thread] Current Thread [Next in Thread]