groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idea] troff -Troff


From: G. Branden Robinson
Subject: Re: [idea] troff -Troff
Date: Sun, 18 Feb 2024 17:14:37 -0600

Good stuff here from James and Larry.

At 2024-02-17T17:47:32-0500, James K. Lowden wrote:
> On Fri, 16 Feb 2024 10:49:52 -0600
> "G. Branden Robinson" <g.branden.robinson@gmail.com> wrote:
>
> > That's before lex.  That's even before yacc.  That's before the first
> > editions of major texts in parser theory now taught to computer
> > science undergraduates were even written.
>
> Yes, but that overstates the case.  We're talking about the country's
> premier computer research lab, not an undergradute introduction to
> parsing.  They were well aware of the state of the art.

That's true.  But it was still early days compared to now.

And, to be fair, I think there was a thread on TUHS in the couple/3
such years that once people got the power to make little languages with
tools like yacc, they got carried away for a while, and made too many.

(I found it, and quote it below.)

Or, perhaps more precisely, they demonstrated that what I'm calling a
"formal" grammar (likely in an abuse of terminology) isn't a magic
bullet; you can still make an unwieldy language.

> BNF was used to define Algol-60. LR(1) parsing had been investigated
> by Knuth.  yacc implements a published agorithm.  It's very name, "yet
> another" tells us others came before it.

Conceded.

> > the roff language does not have a formal grammar
>
> Not true, unless by "formal"  you mean "expressed as BNF".

More or less, but I was thinking more something that was
machine-verifiable to show that the grammar had no shift/reduce or
reduce/reduce conflicts.

> The parser reads input and produces output.  Input not conforming to
> its syntax is rejected.  That couldn't be true without a grammar to
> compare it to.

That's true, but not in a useful sense in my view.  A recursive descent
parser does whatever it does, but that doesn't mean it's intelligible to
a human.

Let me share a somewhat long quote from Steven C. Johnson.  In fact now
that I check, this is from the TUHS thread to which I alluded; this was
back in 2021.

"I found Dennis[ Ritchie]'s compiler to be a thing of beauty when
compiling statements, but hell on wheels when doing expressions.  One of
the motivations for writing Yacc was that I wanted to add the exclusive
or into the expression syntax and we had agreed on the character (^) and
the precedence.  Practically the whole expression grammar needed to be
rewritten, and small typos created un-debuggable chaos.  The state of
the art at the time was to write multi-layered grammars for each
precedence level.  A paper was published on how to shrink the resulting
parsing tables by eliminating redundant states.   I realized that the
same results could be obtained by writing an ambiguous expression
grammar and using a precedence table to resolve the ambiguities.  The
initial response in the academic community to programming with ambiguous
grammars was somewhere between skeptical and horrified -- as if I had
shown porn to a Victorian.  So Al [Aho] and I worked out a proof that we
would get the same optimized parser in a much more intuitive way.

I do agree with Rob [Pike] that some of the languages that Yacc gave
birth to should never have been born.  Remember, though, that the
dominant language at the time was FORTRAN, and it had all sorts of
strange implementation issues in their hand-written compilers.  Things
like subscript indices had to be single variables in some places, and in
others could have a constant added to the index.  One of Yacc's best
features was that it made consistency of usage the path of least
resistance when designing the language, and the grammar was often easier
to understand than the programming manual.  At Bell Labs, Barbara Ryder
wrote a program that would read FORTRAN and detect things that would not
work on one or more of the six major FORTRAN's at the time.  It was an
inspiration for me, later, do the same thing with Lint.

I do suggest that having languages like C++ that have bloated up to over
1000 pages in the programmer reference doesn't feel like a real advance,
especially since the real language problems of today are how to program
very large numbers of processor-like objects on a single chip.  We need
new ways to think, and I doubt that we'll get there by making C++
require a 2000-page manual."

> > I don't know that roff can be categorized in one of the traditional
> > classes
>
> I see no reason why not.
>
> troff input is described quite succintly in groff(7).

I'm glad you think so.  It's 154 KiB of *roff source at present.

> Each request takes zero or more parameters.  Those parameters may
> themselves include requests; that's recursion.  It's all very well
> defined.  Maybe not as tidy and orthogonal as you might like, but
> languages without warts are, I'm sure you'll agree, vanishingly rare.

I do agree.

> I'm not sure that matters, though.  The question is whether or not man
> macros can be expanded to their groff equivalents.  groff input need
> not conform to any formal theory for the answer to that question to be
> Yes.

I'll believe it when I see it.  The sorry legacy of the many things
called "man2html" tells me that people struggle mightily with this.
Thomas Dickey has an entertaining page on this subject.[1]

But it is a leap from that to "impossible".

> > A bit more theory before I get to the concrete.
>
> Perhaps I missed it, but did you get to the concrete?

Heh.  The pseudo-assembly language for a stack machine was what I had in
mind as "concrete".

> Alejandro said, simply,
>
> >  I wanted a program that reads man(7) source and produces roff(7)
> > source
>
> Isn't that exactly what pic does?  If there's a reason man isn't
> susceptible to the same treatment, can you provide a concrete example?
> Because I can't think of one, and an "arbitrary number of macro
> expansions" isn't one.

I don't have much facility with pic.  Does it admit recursion?

> For example, would a man(7) parser need to recognize groff(7)
> requests?  I'm not a man fan, but pic and tbl specifically say they
> pass low-level requests transparently.

Ingo Schwarze is well placed to answer that, because he's supported raw
*roff requests with great resistance, only doing so when necessary to
give mandoc(1) coverage of pages OpenBSD ships that might threaten the
mission of replacing groff if mandoc couldn't, actually, usefully render
them.  I think there's far more "raw roff" support in it than he or
Kristaps Dzonsons is happy with.

> From my point of view, as a guy who's been using Bison for the last 3
> years to parse Cobol, a project to parse man(7) and produce groff(7)
> is 100% feasible.  It doesn't even look particullarly hard; the whole
> description is under 2000 words.

I have so little familarity to Cobol that I simply can't evaluate this
claim.

At 2024-02-17T17:47:57-0500, James K. Lowden wrote:
> On Sun, 18 Feb 2024 09:37:10 -0500
> Douglas McIlroy <douglas.mcilroy@dartmouth.edu> wrote:
>
> > Translation involves parsing input into an AST according to one
> > grammar and unparsing  to generate output according to another.
> > Chomsky's work uses transformational grammars primarily for
> > generation. I'm not aware of any implementation of the inverse:
> > parsing according to a transformational grammar.
>
> Maybe Chomsky doesn't matter?
>
> ISTM that "parsing input into an AST according to one grammar and
> unparsing  to generate output according to another", if it has no
> theoretical examples, has many in practice.  It's so common they've
> acquired the hip name: transpiler.

Is that really how transpilers work?  I haven't worked with any.  The
one I've read the most about is for ECMAscript, and the impression I
have is that revisions to the language are explicitly designed so that,
and only accepted if, straightforward rewriting rules can be applied to
ECMAscript 6 or whatever to produce input that Rhino or the JavaScript
interpreter in Internet Explorer 5 would handle.

> For an example close to home, I give you cfront: C++ in, C out.

All of these, as far as I know, are not invertible transformations, and
a lot of my answer was predicated on that fact.

> I think you're saying there's no inverse to yacc.  (Even that gives
> yacc too much credit.  yacc doesn't produce an AST; it just lets the
> user create one.)  Be that as it may, it certainly hasn't prevented
> language-to-language translation.

Without an invertible transformation, then as I said to Alex, the only
solution within GNU troff is to stitch in some logic at many points in
the formatter, certainly before any interpolation (of a macro, string,
or register--for a start) is done, to record the state of the input
stream at that point.

Of course this doesn't have to be done inside GNU troff.  You could do
it with something else, that understands everything GNU troff does.  It
will be the same problem.  Or so I claim.

There are other aspects of formatter state to worry about, like the
vertical drawing position.  How do you run a trap in reverse?

> > Unfortunately, one doesn't consciously write roff according to the
> > model I have outlined. This means that parsing it is more like
> > parsing a natural language than a strictly defined programming
> > language.
>
> I'm very surprised to hear you make that argument.  Whatever the user
> may consciously do, the machine interprets the input and produces the
> output.  groff isn't making heuristic large language model guesses
> about the input.  It parses it, quite mechanically, regardless of the
> fluidity of the prose or the author's attention to convention and
> practice.

I think Doug is referring to an input document that has been fine-tuned,
still within the language but otherwise any means necessary, to make the
output attractive.

Admittedly, the majority of man page authors have about zero concern for
attractive output, and for those who compose man pages in alternative
markup languages that have to be transformed _to_ man(7) (or mdoc(7)),
it would seem to be an anti-goal.

At 2024-02-18T13:19:08-0800, Larry McVoy wrote:
> > > the roff language does not have a formal grammar
> >
> > Not true, unless by "formal"  you mean "expressed as BNF".  The
> > parser reads input and produces output.  Input not conforming to its
> > syntax is rejected.  That couldn't be true without a grammar to
> > compare it to.
>
> Gonna have to go with James here.  I have scripts that take troff -ms
> input and produce slide decks.

I don't grasp what that's a counterexample of.

If someone wants to put me in my place by delivering Alex a tool that
will do what he wants, go for it!  It's hard for me to imagine that such
a thing could not be turned to the objective of improving man page
quality, and tools for that please me like punch.

Regards,
Branden

[1] https://invisible-island.net/scripts/man2html.html

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]