groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] handling of composing and combined Unicode characters


From: Werner LEMBERG
Subject: Re: [Groff] handling of composing and combined Unicode characters
Date: Tue, 10 Jan 2006 07:58:52 +0100 (CET)

> Find attached two files find.vi.after-preconv and
> find.vi.after-preconv-decomposed, containing two test cases:
>   1) \[u1EBF] = 'e' \u0302 \u0301
>   2) 'x' \u0302 \u0301 (no precomposed character in Unicode)
> 
> I call
>    $ troff -mandoc -Tutf8 < find.vi.after-preconv[-decomposed]"
> 
> The expected behaviour of troff should be that it emits a composed glyph,
> created by src/roff/troff/input.cpp:composite_glyph_name(), right?
> In both cases the behaviour is different:
>   1) find.vi.1:6: warning: can't find special character `u0065_0302_0301'
>   2) find.vi.1:6: warning: can't find special character `u0302'
>      find.vi.1:6: warning: can't find special character `u0301'
> 
> Is my assumption right?

groff sees \[u1EBF] and automatically decomposes it to a generic
*entity name*, namely `u0065_0302_0301'.  It doesn't care whether
\[u0302] and \[u0301] exist actually.  This process is documented in
groff.info (Using Symbols):

   * A glyph representing more than a single input character will be
     named

          `u' COMPONENT1 `_' COMPONENT2 `_' COMPONENT3 ...

     Example: `u0045_0302_0301'.

     For simplicity, all Unicode characters which are composites must
     be decomposed maximally (this is normalization form D in the
     Unicode standard); for example, `u00CA_0301' is not a valid glyph
     name since U+00CA (LATIN CAPITAL LETTER E WITH CIRCUMFLEX) can be
     further decomposed into U+0045 (LATIN CAPITAL LETTER E) and
     U+0302 (COMBINING CIRCUMFLEX ACCENT).  `u0045_0302_0301' is thus
     the glyph name for U+1EBE, LATIN CAPITAL LETTER E WITH CIRCUMFLEX
     AND ACUTE.

   * groff maintains a table to decompose all algorithmically derived
     glyph names which are composites itself.  For example, `u0100'
     (LATIN LETTER A WITH MACRON) will be automatically decomposed
     into `u0041_0304'.  Additionally, a glyph name of the GGL is
     preferred to an algorithmically derived glyph name; groff also
     automatically does the mapping.  Example: The glyph `u0045_0302'
     will be mapped to `^E'.

   * glyph names of the GGL can't be used in composite glyph names;
     for example, `^E_u0301' is invalid.

Either you register `u0045_0302_0301' with .char directly in your
document (or in a proper macro file, say, `vi.tmac'), or you add this
to the devutf8 font description files.  I prefer the latter.
Currently, I don't have time to add complete Vietnamese support by
myself, but doing so should be straightforward.  Patches welcome :-)


    Werner




reply via email to

[Prev in Thread] Current Thread [Next in Thread]