groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Groff UTF-8 support? - Groff documentation section 5.1.9 Input Encod


From: ropers
Subject: Re: Groff UTF-8 support? - Groff documentation section 5.1.9 Input Encodings
Date: Fri, 8 Mar 2024 09:39:15 +0000

> > > On 3/7/24, ropers <ropers@gmail.com> wrote:
> > >> "latin1" sounds awfully ISO-8859-1ish, and (I fear) not very much like
> > >> the Latin-1 Supplement Unicode block
> > >
> > On 07/03/2024, Dave Kemper wrote:
> > > Correct.  Since there are two different things that include "Latin-1"
> > > in their name, perhaps this wording could be be more explicit.  On the
> > > other hand, the context is input encodings, and a Unicode block is not
> > > itself an input encoding.
> >
> > It might be preferable to demine rather than rely on contextual hints
> > as to the presence of UXO:
>
> What's "UXO"?  Google suggests that it's "unexploded ordnance".  While
> I'm sure we have some of that in groff,[1] I don't see the application
> here.
>
> There's not even an "aka" [sic] here to throw me a bone...

Sorry about that.  Yes, unexploded ordnance.  I was speaking
metaphorically and meant to say that it might be better to remove the
pitfall opportunity, rather than be mollified and assuaged by the
knowledge that there are perfectly good contextual hints for Harry to
avoid falling in.
No, wait -- that's metaphorical too.  I meant to say that perhaps it's
a good idea to go ahead and try to fix the wording despite the fact
that Dave was quite right to say a Unicode block isn't an encoding as
such.

I did not mean to imply or allude to anything more nefarious, like
bugs in the code ready to blow up.

I did mean to use the metaphor to segue to this diff:

> > $ diff -u groff.texi.orig groff.texi
> > --- groff.texi.orig   2024-03-05 18:20:59.940460376 +0000
> > +++ groff.texi        2024-03-08 00:21:12.782360544 +0000
> > @@ -5509,9 +5509,10 @@
> >  @cindex ISO @w{8859-1} (@w{Latin-1}), input encoding
> >  @cindex input encoding, @w{Latin-1} (ISO @w{8859-1})
> >  @pindex latin1.tmac
> > -ISO @w{Latin-1}, an encoding for Western European languages, is the
> > -default input encoding on non-@acronym{EBCDIC} platforms; the file
> > -@file{latin1.tmac} is loaded at startup.
> > +ISO 8859-1, aka @w{Latin-1}, an extended ASCII encoding chiefly for
> > +Western European languages, is still @code{groff}'s default input encoding 
> > on
> > +non-@acronym{EBCDIC} platforms; the file @file{latin1.tmac} is loaded
> > +at startup.
>
> I dislike the term "extended ASCII",

De gustibus non est disputandum.

> and I don't think it sheds any
> light on the subject.  It is redundant with respect to the referenced
> character encodings.  All of the ISO 8859 encodings are 8-bit extensions
> of ASCII.

Sure, but at the point where you're trying to explain to the reader
what an ISO 8859-something is, at that point it maybe isn't safe to
rely on knowledge that flows from existing familiarity with ISO 8859.

> Next, why the explicit callout of "8859" in the text in _addition_ to
> the concept index where it is already tagged?

Consistency.

> Every word in the manual needs to pay its freight.

Fair.  (I would've said fair point, but... ;-)

> What purpose is "aka" serving?  First, it is more correctly written
> "a.k.a." and secondly it is not adding any value here.  A noun phrase
> can be renamed in any of several ways (bracketed in parentheses, em
> dashes, or commas, depending on its surroundings).  I find this
> construction to clutter the narrative.

Nolo contendere. (I was tempted to say just *nolo*, but "c'mon Jack"! ;-P)

> > -To use ISO @w{Latin-2}, an encoding for Central and Eastern European
> > -languages, invoke @w{@samp{.mso latin2.tmac}} at the beginning of
> > your -document or supply @samp{-mlatin2} as a command-line argument to
> > +To use ISO 8859-2, aka @w{Latin-2}, an encoding for Central and
> > Eastern +European languages, invoke @w{@samp{.mso latin2.tmac}} at the
> > beginning of +your document or supply @samp{-mlatin2} as a
> > command-line argument to @code{groff}.
>
> This is a copy-and-paste of the foregoing item, and reveals a comfort
> with robotic text replacement that promptly leads you astray.
>
> >  @item latin5
> > @@ -5544,8 +5545,8 @@
> >  @cindex ISO @w{8859-9} (@w{Latin-5}), input encoding
> >  @cindex input encoding, @w{Latin-5} (ISO @w{8859-9})
> >  @pindex latin5.tmac
> > -To use ISO @w{Latin-5}, an encoding for the Turkish language, invoke
> > -@w{@samp{.mso latin5.tmac}} at the beginning of your document or
> > +To use ISO 8859-5, aka @w{Latin-5}, an encoding for the Turkish language,
> > +invoke @w{@samp{.mso latin5.tmac}} at the beginning of your document or
> >  supply @samp{-mlatin5} as a command-line argument to @code{groff}.
> >
> >  @item latin9
> > @@ -5554,9 +5555,9 @@
> >  @cindex ISO @w{8859-15} (@w{Latin-9}), input encoding
> >  @cindex input encoding, @w{Latin-9} (ISO @w{8859-15})
> >  @pindex latin9.tmac
> > -ISO @w{Latin-9} succeeds @w{Latin-1}; it includes a Euro sign and better
> > -glyph coverage for French.  To use this encoding, invoke @w{@samp{.mso
> > -latin9.tmac}} at the beginning of your document or supply
> > +ISO 8859-9, aka @w{Latin-9} succeeds @w{Latin-1}; it includes a Euro sign
> > +and better glyph coverage for French.  To use this encoding, invoke
> > +@w{@samp{.mso latin9.tmac}} at the beginning of your document or supply
> >  @samp{-mlatin9} as a command-line argument to @code{groff}.
> >  @end table
>
> ...as here.  You have misidentified the relevant character encoding
> standards.
>
> The numbered parts of the ISO 8859 character encoding standards are not
> identical to the numeric suffixes applied to them.
>
> ISO 8859-5 is not ISO Latin-5, but Latin/Cyrillic.
>
> ISO 8859-9 is not ISO Latin-9, but Latin-5/Turkish.
>
> https://en.wikipedia.org/wiki/ISO/IEC_8859
>
> I am strongly dubious of proposed corrections to groff documentation
> that make it unquestionably incorrect.
>
> I suggest you check your sources better before making recommendations.
>
> > Внимание!
> > I have not actually previewed this!

I went to diffs of groff.texi rather than sticking with purely
descriptive narratives so as to try and be more helpful.  My above
warning however was precisely intended to serve as a
"Don't-Really-Know-What-I'm-Doing-Doge" sign.

> If you're going to propose alterations to documentation, I suggest you
> develop confidence in them through empirical testing of your claims
> first.

It's a judgement call.  Is it preferable to not report possible issues
for lack of confidence and polish, or is it better to report, complete
with full disclosure and fair warning as above, and be willing to
accept rejection or correction?  You tell me what you prefer.

> > Truth be told, info(1) is Greek to me.  I've tried
> > $ info groff.texi #,
>
> Several of our man pages describe how to read the Texinfo manual in info
> format.
>
> groff(1):
> groff_diff(7):
> groff_font(5):
> groff_me(7):
> groff_mm(7):
> groff_mom(7):
> groff_ms(7):
> groff_out(5):
> groff_tmac(5):
> groff_trace(7):
> troff(1):
> See also
> [...]
>      Groff: The GNU Implementation of troff, by Trent A. Fisher and
>      Werner Lemberg, is the primary groff manual.  You can browse it
>      interactively with “info groff”.
>
> groff(7):
>
> Description
>      groff is short for GNU roff, a free reimplementation of the AT&T
>      device‐independent troff typesetting system.  See roff(7) for a
>      survey of and background on roff systems.
>
>      This document is intended as a reference.  The primary groff
>      manual, Groff: The GNU Implementation of troff, by Trent A. Fisher
>      and Werner Lemberg, is a better resource for learners, containing
>      many examples and much discussion.  It is written in Texinfo; you
>      can browse it interactively with “info groff”.  Additional formats,
>      including plain text, HTML, DVI, and PDF, may be available in
>      /home/branden/groff-HEAD/share/doc/groff-1.23.0.
>
> Have you consulted any of these man pages?

Yes.  And insofar as you may have been asking rhetorically to
underline what you just quoted here: As per Explain xkcd, the fact
that this kind of advice is not fit for purpose -- well, that's the
joke.  Or one of xkcd 912's big jokes anyway.

IMNSHO every one of those hints --repeated on so many ingenue man
pages-- is an indictment, a confession, an indication that ordinary
decent (l)users like the rest of us have to be forced to use info(1).
Which is an indication of conflict outcome more than anything.

Here's John Malkovich as Pascal "Info" Sauvage:
"All these stupid little lusers have to do is stan online and do what
they are told for one miserable day, but can they do that?  Can they
do that?
 My fragrant info(1) arse they can't!" --from Johnny English (1)

The more you tighten your grip, Brandkin, the more Play-Doh will slip
through your fingers, and the playdoh, green is people.  We The People
who say D'oh!

> The groff source distribution also provides its Texinfo manual in four
> other formats: PDF, TeX DVI, HTML, and plain text.
>
> You can furthermore find all of these formats at the groff web page.
>
> https://www.gnu.org/software/groff/manual/index.html

Sure, but they're pre-converted from an unedited groff.texi.  The
thing I've given up on was figuring out how to preview my edits to
groff.texi and possibly make them less iffy.  I might even have
spotted some of my above mistakes, but there we are.

> I understand people not wanting to develop a competence with the GNU
> Info reader (or Emacs, for that matter), but I think any suggestion that
> groff has made its own documentation inaccessible by use that format
> severely disserves the truth.

Well, that's not quite how I put it, but *nolo*.

> > which made it say "Cannot find node 'Top'." at
> > the bottom (pun intended?), and then I couldn't figure out how to
> > actually view the groff info manual.  Not that I've tried much, but
> > still.
> > IMNSHO it is incredibly ironic, and--if one could hurt a program's
> > feelings--almost insulting for groff's manual to be maintained in info
> > format.  Not exactly dogfooding, no?
>
> Welcome to the Larry McVoy club.  You can search the groff mailing list
> archives for his name for a series of recurring threads raising this
> complaint every few years.

Honestly, I'd never run across him before, but sounds like a man who
knows what he's talking about.

> Something that no one making this gripe ever seems to notice is that
> I've been copying more and more content over the past several years into
> groff's man pages from our Texinfo manual, and maintaining the content
> in parallel.  In the list archives you can find our former maintainer
> Werner Lemberg's rationale for maintaining _a_ grof Texinfo manual, and
> I find his reasoning sound.  Long story short, there are two major
> factors: (1) the format differ in purpose; a Texinfo manual, _ideally_,
> reads comfortably as a book.  It is (can be) organized in such a way
> that you can start out reading it at the beginning knowing very little
> about the subject, and the manual will spin you up.

That's precisely what I'm presently reading the Groff documentation
for, online (for the latest version), and in HTML format (mostly; some
content is better suited to the PDF, which I've cross-checked on
occasion).

Of course, beyond man page conventions, there are plenty of macro
packages that would allow dogfooding even of book-like content (hi
mom!).

Possibly my most visceral dislike of info(1) has to do with the forced
"need" to learn yet another counter-intuitive and insufficiently
discoverable tool (and break comfortably established habits) just to
learn about tool usage.  Chicken/egg, btw?

> To sum up, negative opinions of groff's Texinfo manual seem to arise
> nearly exclusively from people who refuse to engage with the material.

Again, happy with the content itself, not happy with the info(1) tool,
and bemused by the not-quite-dogfooding groff(1) irony, especially
given groff's heft.  Note *when* I complained and over *what*.

> With such people, I have little hope of fruitful dialogue.

For the aforementioned reasons, I don't see myself as one of such people.
$ man straw
No manual entry for straw

> > At the peril of slighting the
> > local champion, my opinions on info(1) reduce to <xkcd.com/912>, and I
> > suspect
> > $ info mcas
> > is a synonym for
> > $ kill -9 346 #,
> > and in light of his prescience, I remain unconvinced *Primer* wasn't
> > based on the exploits of one Randall Munroe + colleague.
>
> I cannot make sense of the foregoing.  I have no moral objection to an
> opinionated rant--the more informative and artfully written, the
> better--but I would suggest that one should be proofread to the point
> where one's audience has a fighting chance of comprehending it.

The relevant xkcd strip which all but predicted pilots fighting MCAS
<enwp.org/Maneuvering_Characteristics_Augmentation_System> (or
intransigent IT, at least) was published years before the Spirit of
Renton rolled down a runway in the eponymous locality.
The Max-8 very suddenly and unceremoniously killed 346 before before
it was grounded.
The #, shell comment syntax was just a quirk, a way I could put
sentence punctuation on the same line and not feel bad about it.
Primer is the famous(ly nerdy) time travel film with a cult fanbase
and little chance of a viewer's understanding it while watching it
just the once and without googling supplemental documentation.
Randall Munroe is the author of xkcd, but there were two main
protagonists in Primer (SPOILER: before they turned on the first box),
hence the +1.
All of this is also googleable -- and didn't you just criticise me for
not putting enough effort into figuring out the info-rmation *you*
like? :o) HHOK/HHOS, take your pick.

Honestly--and awkwardly--if my in-joke humour annoys you, you're not
going to like the other email I've started drafting -- it's freaking'
full of 'em.
In my defence, I stand ready to explain everything.  Deal?

> > Based on my admittedly not quite unlimited insight into Unicode
> > issues, if taken literally, a mission statement "to extend groff from
> > 8bit to 32bit input characters" strikes me as an already outmoded if
> > not stillborn strategy.  It might be much better to go all-in on
> > variable-width encoding, read: UTF-8, just like everybody else.
>
> <pinches bridge of nose>
>
> What good would that do when the formatter needs to look up character
> properties using the character code as an index?  That's going to make
> the formatter's internals more complex, slower, and prone to error for
> no benefit apart from the useless boast that "groff is UTF-8 all the way
> through".  Which will be a lie in many cases no matter what...
>
> > Whatever limited *strictly internal* use there may still be for UTF-32
> > in some buffers, structs or variables,
>
> Frankly, that's precisely the damned point.

I think there's actually so little daylight between our positions on
that point, if I actually argued back on the above, I'd only be
engaging in a narcissism of small differences.

> Whether GNU troff can parse
> UTF-8 as input, and whether it can or should produce UTF-8 as (the
> encoding of its) output, are almost _entirely_ unrelated questions.
>
> Users rightfully care a lot about the former.  About the latter, I think
> some users claim to care, but they don't really, because none of them
> write postprocessors for GNU troff (nor any other troff).  What they
> care about is the final output format: PDF documents and streams to
> terminals being of foremost importance.

Where I started caring about the the pipe-work plumbing solution was
somewhat near the entrance to the rabbit hole I went down to get here.
More on this soon -- possibly.

> groff has already for many years produced UTF-8 output to terminals.
>
> And UTF-8 isn't even relevant to PDF, which uses UTF-16 (I thought LE,
> but another source I've hit recently says BE, so I need to nail this
> down--our resident PDF expert is unfortunately on sabbatical).
>
> > anything not UTF-8 is probably best kept to a minimum.
>
> This statement is either a truism or meaningless.  UTF-8 is very much
> worth supporting at endpoints.  The size of the "minimum" of internal
> implementation detail is neither here nor there.  And if you're unlucky
> enough to be dealing with a file format that hitched its wagon to UTF-16
> because Adobe engineers think a lot like those at Microsoft, you won't
> get far with a puritanical UTF-8 approach.
>
> > But perhaps I'm barking at shadows here.
>
> Possibly.
>
> > Nothing in this <https://lists.gnu.org/r/groff/2004-05/msg00074.html>
> > is smoking-gun evidence that would compel a jury of me, myself and I
> > to conclude Werner et al. WEREN'T aware of that already, or if not
> > then, then certainly now.
>
> I believe he was aware of the issues.
>
> > I was a few paragraphs into that before I realised the author of the
> > above comment is Ingo Schwarze, an OpenBSD dev I've previously talked
> > to, and whose judgement on this I trust A LOT.
>
> Ingo is a worthy interlocutor and sounding board, but I've caught him
> out in error more than once (as he has me).  I take his views seriously
> but do not regard him as an oracle.
>
> I would point out that Ingo's lengthy jeremiad
> <https://savannah.gnu.org/bugs/?40720#comment4> was against `wchar_t`
> and against Bernd Warken's suggestion to create our own data type for
> storing code points, to which Ingo retorted:
>
> "You do not create a new type when a type for exactly that purpose
> already exists in the C standard."
>
> Fortunately, groff has procrastinated this work for so jaw-droppingly
> long that int32_t, a standard C and C++ type, will likely be supported
> on every platform of interest (where it isn't already) by the time
> anyone can think of landing such an internal refactoring.
>
> And int32_t is exactly where I aim to go if I do this work myself.
> Might use the sign bit for "tokens" and/or "nodes", groff internal data
> structures that get encoded into macros and diversions.
>
> But as an _internal refactoring_, no one should notice when we do it,
> unless their libc (or compiler?) chokes on the type.  And for them we'll
> always have gnulib, I trust.
>
> > I really only dove into the groff manual thanks to an observed
> > (kernel.org) ascii(7) man page bug I only have a partial fix for,
> > which is why I'm still reading, all of which I'll possibly talk about
> > at a later date.
>
> I have plans for that page myself, but if you make them unnecessary that
> will be fine with me.
>
> The issue with the vertical rule in the table being drawn wrongly is a
> known issue with several prerequisites ahead of fixing it.  _Maybe_ I'll
> get them all whacked for 1.24.  I sure hope so.
>
> https://savannah.gnu.org/bugs/?65189
>
> Regards,
> Branden
>
> [1] https://savannah.gnu.org/bugs/?65322

Thanks for your time.
Ian



reply via email to

[Prev in Thread] Current Thread [Next in Thread]