groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] Re: groff: radical re-implementation


From: Werner LEMBERG
Subject: Re: [Groff] Re: groff: radical re-implementation
Date: Wed, 18 Oct 2000 00:46:46 +0200 (CEST)

> > Well, I insist that GNU troff doesn't support multibyte enodings
> > at all :-) troff itself should work on a glyph basis only.  It has
> > to work with *glyph names*, be it CJK entities or whatever.
> > Currently, the conversion from input encoding to glyph entities
> > and the further processing of glyphs is not clearly separated.
> > From a modular point of view it makes sense if troff itself is
> > restricted to a single input encoding (UTF-8) which is basically
> > only meant as a wrapper to glyph names (cf. \U'xxxx' to enter
> > Unicode encoded characters).  Everything else should be moved to a
> > preprocessor.
> 
> This paragraph says two things:
>  - GNU troff will support UTF-8 only.  Thus, multibyte encodings
>    will be not supported.  [Though UTF-8 is multibyte :-p ]

This was a typo, sorry.  I've meant that I don't want to support
multiple multibyte encodings.

>  - Groff handles glyph, not character.
> I don't understand relationship between these two.  UTF-8 is a code
> for character, not glyph.  ISO8859-1 and EUC-JP are also codes for
> character.  No difference among UTF-8, ISO8859-1, and EUC-JP.

Well, this is *very* important.  The most famous example is that the
character `f', followed by character `i' will be translated into a
single glyph `fi' (which has incidentally a Unicode number for
historical reasons).  A lot of other ligatures don't have a character
code.  Or consider a font which has 10 or more versions of the `&'
character (such a font really exists).  Do you see the difference?  A
font can have multiple glyphs for a single character.  For other
scripts like Arabic it is necessary to do a lot of contextual analysis
to get the right glyphs.  Indic scripts like Tamil have about 50 input
character codes which map up to 3000 glyphs!

Consider the CJK part of Unicode.  A lot of Chinese, Korean, Japanese,
and Vietnamese glyphs have been unified, but you have to select a
proper locale to get the right glyph -- many Japanese people have been
misled because a lot of glyphs in the Unicode book which have a JIS
character code don't look `right' for Japanese.

For me, groff is primarily a text processing tool, and such a program
works with glyphs to be printed on paper.  A `character' is an
abstract concept, basically.  Your point of view, I think, is
completely different: You treat groff as a filter which just
inserts/removes some spaces, newline characters etc.

> However, I won't stick to wchar_t or ucs-4 for internal code, though
> I have no idea about your '31bit glyph code'.  (Maybe I have to
> study Omega...)

A `glyph code' is just an arbitrary registration number for a glyph
specified in the font definition file.  It is invariable from the
input encoding.  Adobe has `official' glyph lists like `Adobe
standard' or `Adobe Japan1'.  CID encoded PostScript fonts use CMaps
to map the input encoding to these glyph IDs.

> The name '--locale' is confusing since it has no relation to locale,
> i.e., a term which refer to a certain standard technology.

I welcome any suggestions for better names...

>  - Japanese and Chinese text contains few whitespace characters.
>    (Japanese and Chinese words are not separated by whitespace).
>    Therefore, different line-breaking algorithm should be used.
>    (Hyphen character is not used when a word is broken into lines.)
>    (Modern Korean language contains whitespace characters between
>    words --- though not words, strictly speaking.)

Not really a different line breaking algorithm but more glyph
properties (to be set with `.cflags'): disallowing breaks after or
before a glyph for implementing kinsoku shori; for implementing
shibuaki properly we probably need to extend the .cflags syntax to set
glyph properties for whole glyph classes.

For the non-CJK experts: `kinsoku shori' means that some CJK glyphs
must not start a line (for example, an ideographic comma or closing
bracket) resp. end a line (opening brackets).  `shibuaki' means
`quarter space'; this is the space between CJK characters and Latin
characters -- there are Japanese standards which defines all these
things in great detail.

>  - Hyphenation algorithm differs from language to language.

What exactly do you mean?  The only real difficult language which
could be easily supported with groff is Thai (and similar languages).
You need at least a dictionary to find word breaks.  All other
languages can easily be managed with the current algorithm, I believe.

>  - Almost CJK characters (ideographics, hiragana, katakana, hangul,
>    and so on) have double width on tty.  Since you won't use wchar_t,
>    you cannot use wcwidth() to get the width for characters.

This is not a problem.  Just give the proper glyph width in the tty
font definition files.

>  - Latin-1 people may use 0xa9 for '\(co'.  However, this character
>    cannot be read in other encodings.  The current Groff convert
>    '\(co' to 0xa9 in latin1 device and to '(C)' in ascii device.
>    How it works for future Groff?  Use u+00a9?  The postprocessor
>    (see below) cannot convert u+00a9 to '(C)' because the width is
>    different and typesetting is broken.  It is very difficult to
>    design to avoid this problem...

For tty devices, the route is as follows.  Let's assume that the input
encoding is Latin-1.  Then the input character code `0xa9' will be
converted to Unicode character `U+00a9' (by the preprocessor).  A
hard-coded table maps this character code to a glyph with the name
`co'.  Now troff looks up the metric info in the font definition file.
If the target device is an ASCII-capable terminal, the width is three
characters (the glyph `co' is defined with the .char request to be
equal to `(C)'); if it is a Unicode-capable terminal, the width is one
character.  After formatting, a hard-coded table maps the glyphs back
to Unicode.

Note that the last step may fail for glyphs which have no
corresponding Unicode value.

> >   . Finally, we need to divide the -T option into a --device and
> >     --output-encoding.
> 
> What is the default encoding for tty?  I suggest this should be
> locale-sensible.  (Or, this can be UTF-8 and Groff can invoke a
> postprocessor.)

I favor UTF-8 + postprocessor.  Terminal capabilities should be
selected with macro packages; for example, an ASCII terminal would get
the options

  -m ascii --device=tty --output-encoding=ascii

the tmac.ascii file would be very similar to tmac.tty + tmac.tty-char.

A latin-2 terminal would be

  -m latin2 --device=tty --output-encoding=latin2

A Unicode terminal emulating an ASCII terminal would be

  -m ascii --device=tty --output-encoding=utf8

etc.

Using a postprocessor we need only a single font definition file for
tty devices.

> > Yes.  The `iconv' preprocessor would then do some trivial, hard-coded
> > conversion.
> 
> You mean, the preprocessor is iconv(1) ?

Basically yes, with some adaptations to groff.

> The preprocessor, provisional name 'gpreconv', will be designed as:
> - includes hard-coded converter for latin1, ebcdic, and utf8.
> - uses iconv(3) if possible (compiled within internationalized OS).
> - parses --input-encoding option.
> - default input is latin1 if compiled within non-internationalized
>   OS.
> - default input is locale-sensible if compiled within
>   internationalized OS.

Exactly.

> Thus I designed the above 'gpreconv'.  Oh, I have to design
> 'gpostconv' also.

It should be very similar to the preprocessor.


    Werner

reply via email to

[Prev in Thread] Current Thread [Next in Thread]