Re: [Groff] unicode support

groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] unicode support - questions

From:	Werner LEMBERG
Subject:	Re: [Groff] unicode support - questions
Date:	Tue, 24 Jan 2006 14:43:56 +0100 (CET)

> So far, I have a first draft of a patch that makes groff work with
> Unicode fonts without having to first register thousands of
> characters.

Great!

> Before submitting the patch slice after slice, may I have your
> opinion about four questions?
> 
>   1) In nametoindex.cpp and troff/charinfo.h, the term "ascii_char"
>      and "ascii_code" is used for unibyte characters in the input
>      encoding.
>      As far as I understand,
>        - values >= 128 are possible and valid,

Yes.  Just to make sure: The stuff defined in nametoindex.cpp (in
particular font::name_to_index) is used only while loading a font,
either within troff or within a device driver.  It is *not* used for
input data handling.  The redefinition of font::name_to_index (and
font::number_to_index) in troff's input.cpp is for handling `virtual'
entities defined with the `.char' request.

>        - when the "latin1" device or "cp1047" device or "latin2"
>          device (found in some Linux distributions) is used, values
>          >= 128 denote characters of this encoding.

Yes.  Macro files like latin1.tmac immediately map the input character
to the corresponding glyph entity.  Hyphenation patterns are applied
to the original input characters, though.

[The (in)valid input characters are listed in file
 src/libs/libgroff/invalid.cpp.  Note that almost all `invalid'
 characters are used internally (see file src/roff/troff/input.h) --
 in a Unicode implementation using 32bit integers, it's probably
 easiest to use negative values for those.

 A special case is handling of hyphenation patterns, which have a
 separate input code translation using `.hpfcode'.  This is to allow
 easier access to LaTeX hyphenation pattern files.  Note that .hpfcode
 is only active during the reading of hyphenation pattern files.]

>      So I would like to rename these to "single_char" and
>      "single_char_code" respecively. Is that OK?  Do you find
>      "unibyte" a better term?

Assuming that this is a temporary change until we have real 32bit
input slots it's up to you.

>   2) When CP1047 is used, and commands like .trin \[char72]\[,c] are
>      active, does the font::name_to_index API see the character name
>      before or after the translation? I.e. does it see "char72" or
>      ",c"?

font::name_to_index is not affected by character translation.
Otherwise the following snippet would give a wrong result, assuming
that font FOO isn't preloaded by groff, this is, it isn't listed in
the DESC file (entries in this file are handled before any character
translation takes place).

  .trin XY
  .ft FOO
  .trin XX
  X

Please note that over the years I've removed all entries in the font
definition files which are hardwired to input characters.  With other
words, they contain neither `charXXX' entities nor character codes
outside of the ASCII range.  The old code in troff which allows
something else is there just for backwards compatibility.

>   3) My current patch creates two subclasses 'enumerated_font' and
>      'unicode_font' of 'class font'.
> 
>      An enumerated font has all its characters enumerated in the
>      font file.  A unicode font covers all combined Unicode
>      characters (consisting of a base character and zero or more
>      combining characters).

It's not clear to me what you want to do.  Please give an example and
elaborate.

The most important thing IMHO is to define glyph classes which share
the same properties (metrics, linebreak data, etc).  We need this
especially for Asian scripts.

>   4) Currently the API of nametoindex.cpp has a different
>      implementation at the end of troff/input.cpp.

See above.

>      My current patch needs to go back from the index to the
>      character name, and so an additional inverse table mapping
>      index -> character name needs to be introduced.  This takes up
>      memory and causes extra memory references.  I would be inclined
>      to replace this "int index" with a pointer to an abstract
>      class, say abstract_char, of which the 'class charinfo' (on the
>      troff side) and 'class backend_char' (for the backends) would
>      be subclasses.  This would not only consume less memory but
>      also make the code more robust (as it is easier to misuse an
>      'int' accidentally).  What do you think about this?

Sounds reasonable, so just go on!  As a transitional step it would be
good to add lots of typedefs to classify the input, output, and glyph
data types.  Later on it should be straightforward to invent other
data structures if necessary.


    Werner

[Prev in Thread]

Current Thread

[Next in Thread]

[Groff] unicode support - questions, Bruno Haible, 2006/01/23
- Re: [Groff] unicode support - questions, Werner LEMBERG <=
  - Re: [Groff] unicode support - questions, Bruno Haible, 2006/01/24
    - Re: [Groff] unicode support - questions, Werner LEMBERG, 2006/01/25
    - Re: [Groff] unicode support - questions, Bruno Haible, 2006/01/26
    - Re: [Groff] unicode support - questions, Werner LEMBERG, 2006/01/26

Prev by Date: Re: [Groff] Re: How to make EPS?
Next by Date: Re: [Groff] unicode support - questions
Previous by thread: [Groff] unicode support - questions
Next by thread: Re: [Groff] unicode support - questions
Index(es):
- Date
- Thread