groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] preconv supported encodings


From: Bruno Haible
Subject: [Groff] preconv supported encodings
Date: Sat, 31 Dec 2005 16:05:17 +0100
User-agent: KMail/1.5

Hi Werner,

When I look at the emacs_to_mime conversion table, it already looks like
it contains too many entries. Nobody in his sane mind will ever write a
manpage in CP851 or MAC-ROMAN encoding.

Thinking about long-term cost of supporting an encoding. Now is the moment
when we have complete freedom to decide about the supported encodings. Later,
we can no longer restrict the set of supported encoding, due to backward
compatibility requirement.

There is no need to support all encodings that Emacs provides, since
  1. probably a majority of users already use other text editors than Emacs,
  2. we cannot support all Emacs encodings anyway (think of viqr or emacs-mule),
     therefore there has to be a limited set of supported encodings anyway.

If you choose a large set like now, you will not have many requests for
adding a new encoding. But maintenance will always have to support all
of them. You see already how much it costs to support CP1047.

On the other hand, if you choose a smaller set, you might get a little more
requests for new encodings.

For GNU gettext, I chose to make the set of supported encodings as small
as possible. Started in 2000, it has 42 encodings. (See
gettext/gettext-tools/src/po-charset.c.) For groff, starting in 2006, you
can probably get away with 20 encodings. If I were you, I would start
with the following set; comment out the other entries of emacs_to_mime
entries; and comment them in on demand only.

  US-ASCII
  ISO-8859-1       (for English, Spanish, Norwegian etc.)
  ISO-8859-2       (for Hungarian etc.)
  ISO-8859-5       (for Serbian etc.)
  ISO-8859-7       (for Greek)
  ISO-8859-9       (for Turkish)
  ISO-8859-13      (for Latvian etc.)
  ISO-8859-15      (for French, German, etc.)
  KOI8-R           (for Russian)
  EUC-JP           (for Japanese)
  GB18030          (for simplified Chinese)
  UTF-8            (for all others)

EUC-JP is problematic because not everyone agrees about the conversion
(see http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html),
but the Japanese people are so vocal that it's better to not give them
an opportunity to complain.

This list contains no CPxxx encodings, in particular no WINDOWS-xxxx encodings.
Microsoft continues to extend these encodings over and over again, with the
result that, say, a text written today in CP950 on a Windows-XP machine is
not readable as CP950 on an earlier version of the same OS. For this reason,
the use of these encodings for manpages would be suboptimal.

With this approach you can reduce the amount of stuff that will be considered
LEGACY in 5 years.

Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]