groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mandoc -man -Thtml bug: inconsistent vertical space before .TP


From: Ingo Schwarze
Subject: Re: mandoc -man -Thtml bug: inconsistent vertical space before .TP
Date: Sat, 11 Nov 2023 00:11:40 +0100

Hi Branden,

G. Branden Robinson wrote on Sat, Oct 28, 2023 at 02:34:45PM -0500:
> At 2023-10-26T18:37:58+0200, Ingo Schwarze wrote:

>> In particular, when designing a markup language for documentation, i
>> consider it critical to carefully compare the design to HTML, LaTeX,
>> and mdoc(7) before making final decisions, and there may be a few
>> more that might also be worth looking at for comparison.

> I was not attempting to confess ignorance of HTML, but making the
> much simpler statement that I had not previously bothered to attempt
> a mapping from man(7) macro names to HTML elements.

But *that* is exactly the kind of comparison needed when you design
new macros for the man(7) language: in the context of "i have this
novel idea what man(7) should maybe become able to do, and i have a
rough idea for my new macro or for extending an existing macro," the
"compare the design" i mentioned above means to ask: Can HTML, LaTeX,
mdoc(7) do the same thing, and if so, how?  If the answer is "no",
that does not necessarily mean it's a bad idea, but if the answer is
"i never thought about that", that sounds surely disquieteing.

FWIW, Kristaps and myself have not only "attempted" that, but
completed that very task to a state that i would call ready for
production (but certainly not perfect):

  https://cvsweb.bsd.lv/mandoc/man_html.c

That work started in 2009 and was already quite usable in 2010.

The result isn't great, mind you.  It writes lots of <b> and <i>,
almost none of which are correct according to HTML 5.  It also
writes a few <div> and <span> and <table> that are rather unexpressive
and only mitigated by class= attributes for CSS purposes.

So the idea is absolutely not new and certainly one among several
aspects to consider when designing or extending man(7) macros.

> [2] That may sound like a surprising statement, or at least one begging
>     follow-up.  The idea is this: grohtml attempted to solve the HTML
>     translation problem by working _completely generally_ with any valid
>     *roff input, which is hopelessly loose and not block-structured.

At least part of what you say in this footnote makes sense to me,
though i don't feel ready to judge whether a project besed on these
ideas could succeed.

But one aspect is certain: the main reason why mandoc(1) HTML output
is so much better than grohtml(1) is exactly what you described: it
restricts itself to two macro sets and preserves macro information
from the input right through to the formatters.

I think the AST approach taken by mandoc(1) is more structured,
simpler, and likely more powerful than your "x" extension command
idea, but that doesn't mean the "x" idea is doomed to failure.

> Worry not.  I'm reading.  The 1,500-page spec of the "Living Document"
> is a bit discouraging in its length, however.

Yes, HTML 5 has grown fat, as most languages do when they are no
longer young.  Still, i find the HTML standard easier to read than
most.  If you want something truly horrifc, try reading the XSLT or
ASN.1 standards.  PDF and CSS are also way worse than HTML.

> Alex Colomar thinks groff_man_style(7) is dauntingly long at 20.

And a certain G. Branden Robinson thinks that mdoc(7) is dauntingly
long at 33.

Don't forget the target audience, though.  We expect the average
programmer not interested in markup to use groff_man_style(7)
and mdoc(7) for their daily side task of documentating their own
program.

I only expect the *designer* of man(7) to consider how their
design compares to HTML, LaTeX etc.  Users of man(7) certainly
need not torment themselves with reading the HTML standard.

>>> LS -> <UL>
>>> TP -> <LI> ... </LI>
>>> LE -> </UL>
> [...]
>> Well, *if* you really want to totally redesign the very foundations
>> of man(7) and change it from almost presentation-only and almost
>> in-line-macro only to the totally different paradigms of semantic
>> markup and block oriented, that is definitely one among the many task
>> involved in redesigning.

> At this point I don't think "totally redesigning" man(7) is necessary,
> either in general or to achieve the specific aim above.  man(7) already
> has no list-structuring macros, so I don't have to delete anything.
> Just add a bit of information that, for anything other than HTML output,
> can be harmlessly ignored anyway.

With "totally redesigning" i mean turning the paradigm upside down.
Even with your latest additions, man(7) is still an almost
exclusively presentational language with an almost exclusively
in-line no-block structure.

What you are aiming for is apparently a mostly semantic language with
mostly block-nesting structure.  How is that not "totally redesigning"?

Yes, there are a few exceptions already, but they feel unsystematic:
These are semantic explicit blocks but not fully nesting: .EE .MT .SY .UR
These are semantic in-line macros: .MR .OP
These are structural, nesting, implicit blocks: .SH .SS
This one is an explicitly nesting block but purely presentational: .RS

So there is certainly no coherent design how to achieve semantic
power and block nesting support yet, and denying that any such coherent
design is needed feels dangerous to me.

Look at it from the other direction.  In HTML, <B> originally meant
"bold face" and <I> "italic", so these were presentational elements.
When the paradigm of the language was totally redesigned, they were
redefined to mean "attention grab" and "alternate voice or mood",
respectively (incidentally, losing their mnemonic value in the
process).  Similarly, with your .LS idea, you appear to be about
to redefine the meaning of .TP from "tagged paragraph" (which is
presentational) to "list item" (which is structural).

I'm not saying that can't be done, but please don't gloss over the
point that it's a redesign, lest you exacerbate the risk of making
it a bad redesign.

> The same goes for the keep macros KS/KE that I want.
> Not setting type on a page?  Ignore them.

I forgot what your KS/KE idea was, but if "keep" is in any way
related to the mdoc(7) .Bk macro, then please consider that
that macro is almost deprecated.  It is almost never needed
for anything, and adding something similar to another language
would look like re-inventing a square peg in a round hole to me.

> Okay.  Well, crashing ideas up against reality is one thing feature
> branches are good for.

I absolutely don't believe in feature branches, i hate them furiously,
and i think most OpenBSD developers do - but let's not digress.

> You've said this before.  I put "semantic lift" in scare quotes
> purposefully--mockingly.  I think it's a term that is, in practice,
> used without a clear definition so as to aid the production of hype
> in promoting "solutions".  You know, kind of like "open source".
> 
> _Linguistically_, I don't know that stress emphasis falls within the
> domain of semantics.

Well, in linguistics, various concepts of "stress" are defined in
phonology (not semantics), and then semantics explain what these
phonetic devices mean in a given language.

But even though "stress" is a phonetic concept, if a markup language
conveys that a word is to be set in italic font style because it
carries stress emphasis (also called prosodic stress), as opposed
to being italic because it is, for example, the name of a ship, that
is semantic markup.

> Maybe.  I'd be interested to hear from a competent
> linguist.

Wait wait wait.  The word "semantics" is used as a technical term
in at least three different disciplines: linguistics, philosophy,
and programming language theory.  The latter meaning is for example
already used in the 1989 C standard.

The use of the word "semantic" in the term "semantic markup" is
distinct from the use in programming language theory, even though
it is related.  Just like a programming language has a syntax, a
markup language has a syntax, too.  Just like the meaning of every
syntax element tends to be explained in "semantics" sections in
programming language specifications, the specification of a markup
language typically explains the meaning of every syntax element, too.
The HTML standard even tries to formalize that:

  https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom
  (read that up to the first example, the word "semantics" is
   defined right above the example)

Unavoidably, the definition of "semantics" in markup languages
is much more fuzzy than in programming languages.  In programming
languages, it can be formally defined in terms of the output of
the program.  In markup languages, a formal definition is hardly
possible; what exactly, in mathematical terms, does it mean to
convey "meaning" to a human?

But that doesn't mean the term is useless, even if the definition
is naive, like "a semantic/structural markup element is an element
that helps to convey to the reader what the function of the content
of the element is with respect to meaning/text structure, respectively."

Given that markup languages aim to communicate to humans (just like
natural languages do, and unlike programming languages), maybe a
linguist could help to sharpen the definition further, but sorry,
i'm not a linguist, so i'm stuck with the naive definition for now.

Do you really think that making the distinction of

  .MR cat 1

being semantic markup and

  .LS .TP .LE

being structural markup and

  .IR cat (1)

being presentational markup and the whole decade-long discussion of
separating content and presentation is nothing but useless marketing
hype?  Colour me surprised.

Yours,
  Ingo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]