groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Special characters


From: G. Branden Robinson
Subject: Re: Special characters
Date: Fri, 22 Sep 2023 02:46:34 -0500

Hi Merijn,

At 2023-09-22T08:43:37+0200, H.Merijn Brand via wrote:
> On Thu, 21 Sep 2023 18:51:42 +0000, Lennart Jablonka <humm@ljabl.com> wrote:
> > > [groff 1.23] renders all manual pages useless  
> > 
> > If you aren’t careful, you could evoke the impression your hyperbole
> > is to be taken literally.
> 
> :)

There are many man pages that don't use the '~' input character at
all.[1]  For instance, of the 58 page sources[2] in the groff source
tree, 54 manage to do without it.  These constitute literally hundreds
of pages of documentation (everything but eqn(1), groff_char(7),
groff_diff(7), roff(7)).

The catastrophizing tone of your message is unwarranted and might
discourage people from correcting your inflated claims, since they don't
appear well-founded in evidence in the first place--and we all know
people for whom evidence-based reasoning is an unfamiliar process.

> And my wording was too rigid. Even if *I* do feel these translations
> are silly, they will have value to others, otherwise the changes would
> not have been decided to.

That was my reasoning when promulgating the change.  The idea is to let
people who care about this issue observe it and deal with it, and to
give those who don't (possibly including package maintainers for entire
*nix distributions) a recipe for degrading the glyphs to ASCII.

> > In the man page, you can use the correct characters;  for the output
> > for bad man pages, you can put translations in m{an,doc}.local.
> 
> Are there examples?

Yes, and you quoted the means of your own remedy.

> > > I download the source code to read "NEWS", and I think the relevant
> > > section is
> > > --8<---
> > > o The an (man) and doc (mdoc) macro packages no longer remap the -, ',
> > >   and ` input characters to Basic Latin code points on UTF-8 devices,
> > >   but treat them as groff normally does (and AT&T troff before it did)
> > >   for typesetting devices, where they become the hyphen, apostrophe or
> > >   right single quotation mark, and left single quotation mark,
> > >   respectively.  This change is expected to expose glyph usage errors in
> > >   man pages.  See the "PROBLEMS" file for a recipe that will conceal
> > >   these errors.

...there you go.

The "PROBLEMS" file is in the groff distribution archive--that which the
GNU Project actually releases.  If a downstream redistributor fails to
include this file in their binary packages or similar artifacts, that
might be considered a defect.

Here is what "PROBLEMS" has to say.[3]

--snip--
[groff 1.19.2]

* When viewing man pages, some characters on my UTF-8 terminal emulator
  look funny or copy-and-paste wrong.  Why?

Some Unicode Basic Latin ("ASCII") input characters are mapped to
non-Basic Latin code points in output for consistency with other output
devices, like PDF.  See groff_man_style(7) and groff_char(7) for correct
input conventions and background.  If you use the correct groff special
character escape sequences to input them, you will get correct output no
matter what device the input is formatted for.

However, many man pages are written in ignorance of the correct special
characters to obtain the desired glyphs.  You can conceal these errors
by adding the following to your site-local man(7) configuration.  The
file is called "man.local"; its installation directory depends on how
groff was configured when it was built.

--- start ---
.if '\*[.T]'utf8' \{\
.  char ' \[aq]
.  char - \-
.  char ^ \[ha]
.  char ` \[ga]
.  char ~ \[ti]
.\}
--- end ---

You may also wish to do the same for "mdoc.local".

In man pages (only), groff maps the minus sign special character '\-' to
the Basic Latin hyphen-minus (U+002D) because man pages require this
glyph and there is no historically established *roff input character,
ordinary or special, for obtaining it when a hyphen and minus sign are
both separately available.  To obtain a true minus sign, use the special
character escape sequences '\(mi' or '\[mi]'.
--end snip--

As you can see, the issue is actually quite old; groff 1.19.2 was
released in 2005.  Later releases responded to complaints like yours by
putting logic much like the above directly into the man macro package,
which implied a greater degree of official endorsement than I think was
intended.

> • I personally do not care about Unicode quotes, dashes, and other
>   special tokens for where it has to do with English text.

By "do not care", I infer your to mean that you no preference whether
they show up as characters from the Unicode General Punctuation block,
or Basic Latin (a.k.a. "ASCII").

> • I however *do* care about special characters that are explicitly
>   intended, like bullets, currency indicators, and Unicode glyphs
>   in names like Żáïłēñőŗ

Yes.  groff developers care about those too, which is why there is a
mechanism for specifying the broader menagerie of characters than appear
as keycap engravings on a keyboard.  See groff_char(7).

> • My documentation is written in .pod or .md and then translated
> 
>   $ pod2markdown  < CSV_XS.pm    > doc/CSV_XS.md
>   $ pod2html      < CSV_XS.pm    > doc/CSV_XS.html
>   $ pod2man       < CSV_XS.pm    > doc/CSV_XS.3
>   $ nroff -mandoc < doc/CSV_XS.3 > doc/CSV_XS.man
> 
>   That last line now rewritten to use a perl filter to match my needs
> 
>   $ nroff2man     < doc/CSV_XS.3 > doc/CSV_XS.man
> 
>   This implies that I have no control (yet) over what ends up in the
>   CSV_XS.3 file

You might share this "nroff2man" script with this mailing list to get
feedback from *roff practitioners of long experience.

> > I’m inclined to agree: That sounds like something quite a few people
>    |
>   I'm is what *I* would like to see
> 
> Those special quotes are also used in error messages nowadays (wget,
> gcc, ...) and those also cause extra work, as they are not recognized
> in double-click actions in e.g. xterm, where the ascii alternatives
> are, see e.g. some example lines in ~/.Xdefaults for average Joe:
> 
>  XTerm*on2Clicks:                regex [^ \n*\043\047`|@#:;& ]+
>  XTerm*on3Clicks:                regex [^ \n*\043\047`]+
>  XTerm*on4Clicks:                regex [^ \n]+
>  XTerm*on5Clicks:                line
> 
> To take special quotes in those attributes, many people will have to
> do a lot of work, and some older tools do not even support it.

The question of xterm's determination of highlighting boundaries is
outside the scope of this mailing list.  Unfortunately, Thomas Dickey
maintains no mailing list for xterm.  You might raise the issue on the
bug-ncurses list[4] for want of a better place.

> > might want for different reasons, so it would be good if we had an
> > option for that.
> 
> I know about -Tascii (or GROFF_TYPESETTER=ascii), but that will also
> disable bullets and stuff. I just don't want any translations that are
> now default but not explicit in the source. As I do not know the full
> range of which characters are changed to which other characters, why
> and under what criteria, it is hard to exactly tell you what I
> personally would like to enable or disable.

In my experience the input characters that cause difficulty are the five
listed above.  Two others, \ and ", tend to cause _syntactical_ problems
when misused, so inexperienced man page authors notice these issues much
more readily.

The groff_char(7) man page covers these issues in exhaustive detail.  I
commend its section "History" to your attention.

> That will be the translator tools. As said, I write pod or markdown.
> There is a complete snakepit in that toolchain if the source code
> contains actual UTF-8, as it is likely that part of that is lost or
> b0rked along the way.

If the source code contains actual UTF-8, then pod2man/podlator should
produce man(7) source that uses correct special character escape
sequences.  However, this is a thorny problem because the special
character repertoire is not completely portable to all *roff
implementations.  (Unfortunately, circa 1980 when Kernighan rewrote
Ossanna's troff for device-independence, he decided not to mandate a
list of special character identifiers that every output device should
support.  A similar problem afflicts font identifiers.)

Because Perl means to be portable to proprietary Unices that still cling
to the (long unmaintained) troff from Unix System V, there are bound to
be problems in this area, even if correct input practices are employed.
In AT&T troff there's no way to even inquire if a certain special
character is available.

The lines of communication between pod2man/podlator maintainer Russ
Allbery and groff developers are open and have seen recent use.[5]

> As a perl5 developer, I have tons of sources of more than 100 versions
> of perl and only the pod files add up to 231920 lines of
> documentation.  That excludes the pod documentation inside the pm
> files which probably runs into several million lines of documentation.

That's a good argument for pod2man(1) producing the best output that it
can; I told Russ to consider me a resource for relevant improvements.[6]

> > That would be sad, seeing as 1.23 has so many improvements in its
> > documentation that make it easier for the reader to grasp good
> > practice.
> 
> Documentation++
[...]
> It would be awesome if I could have the full list of potential
> translations and the matching lines in the mandoc file, so I can choose
> which ones to keep and which ones to disable

See above, and please consider setting some time aside to read the
groff_char(7) page.

Regards,
Branden

[1] Contrast with the '\~' escape sequence, which produces an
    unbreakable adjustable space, but not a glyph.
[2] One, groff_man.7.in, produces two different pages at build time.
[3] https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?h=1.23.0#n82
[4] https://lists.gnu.org/archive/html/bug-ncurses/
[5] https://lists.gnu.org/archive/html/groff/2022-12/msg00151.html
[6] https://lists.gnu.org/archive/html/groff/2022-12/msg00169.html

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]