groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] What's missing for Unicode support of groff?


From: Bruno Haible
Subject: [Groff] What's missing for Unicode support of groff?
Date: Thu, 7 Jul 2005 13:36:36 +0200
User-agent: KMail/1.5.4

You probably shuddered and laughed when you saw the hacks contained in
groff-utf8.tar.gz, but it shows which areas need work before groff can
handle man pages in Japanese and Vietnamese by default.

  1) Recognition of the input file encoding.
  2) The font system and the utf8/html devices.
  3) Rendering and the other devices.

1) Currently on a Linux system you find man pages in the following encodings:
     - ISO-8859-1 (German, Spanish, French, Italian, Brasilian, ...),
     - ISO-8859-2 (Hungarian, Polish, ...),
     - KOI8-R (Russian),
     - EUC-JP (Japanese),
     - UTF-8 (Vietnamese),
     - ISO-8859-7, ISO-8859-9, ISO-8859-15, ISO-8859-16 (man7/*),
   and none of them contains an encoding marker.

   The agreement was to recognize the encoding according to a note in the
   first line
           '\" -*- coding: EUC-JP -*-
   groff will then emit errors when it is fed input that is non-ASCII and
   without coding: marker, so that man page maintainers are notified that
   they need to add the coding: marker.

2) The font system of groff was designed for devices where groff has to
   map each character to a font. However, for the utf8 and html devices,
   this is not the case: here groff has to skip this step. The current
   font system has not been updated and is therefore in the way:
     - Characters that are not mentioned in the "charset" section of
       the font files for these devices are dropped from the output. This
       is wrong.
     - If the "charset" section of each font file would contain 1 million
       of Unicode characters, the initialization time of 'troff' and of
       the postprocessors would be prohibitively high.

   IMO, the solution is to
     - remove the "charset" section of the font files for utf8 and html,
     - split the "font" C++ class into a class hierarchy
          class font; // abstract
          class concrete_font: font; // useful for other devices,
                                     // with "charset" section
          class algorithmic_font: font; // useful for utf8, html devices,
                                        // without "charset" section,
                                        // determines the width of each
                                        // character algorithmically.

3) For devices such as DVI, PS, X100, implement rendering of composed
   characters, for bidi languages (Hebrew, Arabic, Farsi) and for Indic
   languages (with vowel reordering).
   The obvious king's path for this is to use GNOME's pango.

I know that work has begin on 3). Since for languages such as Chinese and
Russian - "Unicode level 1" -, only 1) and 2) are needed, my priority would
be on 1) and 2). I.e. I volunteer to work on that.

Werner says:
> Something like this [Tomohiro Kubota's iconv preprocessor, i.e. 1)]
> should become part of groff as soon as it supports Unicode on the input
> side.

What else is needed to support Unicode on the input side?

Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]