groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: groff supports Italian input documents now


From: Oliver Corff
Subject: Re: groff supports Italian input documents now
Date: Sat, 3 Jul 2021 14:42:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1

Hi Branden,

thank you so very much for your exhaustive answer!

Now I understand much better that I actually did not miss too much with
my approach. That makes me think if it is useful to develop a macro
package for switching of language settings which covers the mentioned
fields (inter-sentence spacing, quotation marks, decimal and currency
notation, given names of certain labels, hyphenation, ... anything
obvious forgotten?) for those languages our fellow groff users are
working with. I am aware of the Swedish and some other extensions, but
they are limited to one macro package.

For "german quotes" (hence the name) I use the following construction:

.ds gq \[Bq]\\$1\[lq]

which requires \*[gq "for more than one word"] but that's fine with me.
The correct quotation marks do contribute to a pleasant reading
experience, as does the proper choice of the dashes of various length.

I'll make some experiments and let you know.

Best regards,

Oliver.


On 7/3/21 4:50 AM, G. Branden Robinson wrote:
Hi, Oliver!

At 2021-07-02T15:38:38+0200, Oliver Corff wrote:
Hi Branden,

may I ask an obviously ignoramus question: What would the settings for
Italian be different from other languages, except for hyphenation
rules and perhaps the proper choice of quotation marks, decimal
separators, names of dates and fixed headlines ("References",
"Abstract" etc.)?
You've pretty much covered it, if we swap in "inter-sentence spacing
amount" for "decimal separator".  It seems that the EU has standardized
on "no additional inter-sentence space" in its typography, so our Czech,
German, French, Italian, and Swedish localization files all say
        .ss 12 0
.

The vast bulk of groff localization is concerned with two things:
localization of strings provided by macro packages, and setting up
hyphenation patterns.

There is a brief how-to document in the groff source tree[1].  I updated
it earlier this year, and Edmond Orignac's contribution of Italian
prompted me to further improve it in the past few days.

We see little in localization files about the decimal separator or
quotation marks.  roff systems pretty much stick to integer arithmetic
and at the level of request syntax and diagnostic output, groff is not
localized anyway.  The historical full-service macro packages seem not
to have been concerned with abstracting quotation, probably because of
the limited glyph repertoire available when they were first developed.
mdoc does, but since its domain is man pages it tends to encode glyph
identities into the semantics of its macro calls (e.g., "Sq" for "single
quote this").

How das groff a) detect that the input file is Italian
Important to note here--it doesn't.  groff doesn't detect this--it has
to be told.

und b) decide which settings to apply?
I revamped groff input localization a few months ago.  It occurred to me
that the mechanism groff had innovated for this purpose (specify options
like -mfr for French) was duplicative of an existing and much more
widely understood infrastructure for tackling such issues: locale(7).

Here's the relevant NEWS item (recently updated in one detail).

---
o The groff locale (the default input language) is now determined using
   the system locale.  The LC_ALL and LANG environment variables are
   checked, in that order.  If set, the value's first two characters
   determine the groff locale.  If these variables are not set, if the
   first one found is set to "C", or if no groff localization file exists
   for the language, groff falls back to English, loading en.tmac.

   Those who want groff's default locale to differ from LC_ALL/LANG
   should edit the troffrc file to source the appropriate groff locale
   macro file (cs.tmac, de.tmac, den.tmac, fr.tmac, it.tmac, ja.tmac,
   sv.tmac, zh.tmac).

   The default hyphenation mode (as used by the .hy request) for users of
   English is thus changed from "1", which was inappropriate for the
   TeX-based hyphenation patterns groff has used since at least 1991, to
   "4".  However, invoking .hy without an argument remains synonymous
   with ".hy 1".
---

I have anticipated, but not yet heard, a protest along the lines that
just because a (for instance) French document is being typeset, the user
might not want to change their locale to begin with "fr".  Did I not
consider the impact on LC_MESSAGES and the possibility of unwelcome
diagnostic messages in French from the groff pipeline?

The answer I prepared for this is a simple one; LC_MESSAGES doesn't
matter because none of the programs or libraries in the GNU roff project
have localized diagnostic messages.  Moreover, I've never seen a demand
for this expressed, and while I reckon I've read every open Savannah
ticket against groff at least once by now[2], I haven't consumed the
entire archived history of this mailing list, I nevertheless surmise
that it has seldom if ever been requested.

My expectation is that people who prefer a "C" or English locale for
most of their shell interactions can keep it, and specify an environment
variable on a per-command basis as needed to format sundry non-English
documents.

I ask because if I typeset a German document then I usually just
insert the request for German hyphenation at the beginning, and I have
one string variable with encloses the argument in proper quotation
marks.  That's more or less enough to get going. Would be nice,
though, to make groff autodetect German and set the appropriate
requests.
There's nothing wrong with the way you're doing things, especially if
you're not using the "me", mm, mom, or ms packages.  It sounds like the
only thing you're missing, and maybe you just didn't mention it, is the
aforementioned ".ss 12 0" request.

Sorry for asking if this obvious to everybody (but me).
I suspect that groff's support for localized input documents was not all
that obvious in 1.22.4 or earlier, and likely still isn't.

Let me dissect the example I posted earlier; I packed several salient
points into it.

$ file EXPERIMENTS/italian.roff
EXPERIMENTS/italian.roff: troff or preprocessor input, UTF-8 Unicode text
$ LANG=it_IT ./build/test-groff -b -ww -k EXPERIMENTS/italian.roff
L’Italia  è  diventata  un mercato di sfruttamento coloniale, una
[...]
che non è compromessa nell’avventura della guerra, che non ha ab‐

A. I ran "file" because I wanted to establish that the input was not in
    a legacy input encoding.  Much of the existing groff documentation is
    preoccupied with input encoding issues.
B. I specified "-k" to groff so that it would run preconv, converting
    non-ASCII code points in the input to groff Unicode special character
    escapes.
C. Instead of saying something like "groff -mit", we can use a standard
    environment variable to assert the locale.  For groff's purposes,
    simply "LANG=it" will suffice.

According to my experiments, I don't need the following in the it.tmac
file.

.hcode á á  Á á
.hcode à à  À à
.hcode è è  È è
.hcode é é  É é
.hcode í í  Í í
.hcode ì ì  Ì ì
.hcode ó ó  Ó ó
.hcode ò ò  Ò ò
.hcode ú ú  Ú ú
.hcode ù ù  Ù ù

uhe contributor, Edmond Orignac, seemed uncertain as to whether
any .hcode requests were necessary, and I am starting to think they
aren't.  They exist in other localization macro files because the
hyphenation pattern files (from TeX) contain non-ASCII code points, so
the pattern file parser[3] has to be told how to intepret them.  The
ones for Italian don't contain such code points.

That would in turn mean that we don't need this in it.tmac either:

.\" Default encoding
.mso latin1.tmac

Removing it would move us one small step toward the future that Werner
Lemberg envisioned 21 years ago[4].

Unfortunately just updating all of our other pattern files, while a good
thing to do on other grounds, won't get us much closer because (except
for English) they continue to contain non-ASCII code points.  Worse,
they're UTF-8-encoded now, and our pattern file parser doesn't know how
to handle that.

Regards,
Branden

[1] https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/LOCALIZATION
[2] though Dave Kemper's recall of them is superior to mine
[3] src/roff/troff/env.cpp:3784-3989
[4] https://savannah.gnu.org/bugs/?60536

--
Dr. Oliver Corff
-- China Consultant --
Wittelsbacherstr. 5A
D-10707 Berlin
Tel.: +49-30-8572726-0
Fax : +49-30-8572726-2
mailto:oliver.corff@email.de




reply via email to

[Prev in Thread] Current Thread [Next in Thread]