groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Proposed: change `pm` request argument semantics (was: process man(7) (o


From: G. Branden Robinson
Subject: Proposed: change `pm` request argument semantics (was: process man(7) (or any other package of macros) without typesetting)
Date: Thu, 17 Aug 2023 18:44:14 -0500

Hi Alex,

At 2023-08-17T21:12:35+0200, Alejandro Colomar wrote:
> I've had this desire for a long time, and maybe now I have a strong
> reason to ask for it.
[...]
> The problem is that at no point you can have the .roff source, after
> the man(7) macros have been expanded.  Would it be possible to split
> the groff(1) pipeline to have one more preprocessor, let's call it
> woman(1) (because man(1) is already taken), so that it translates
> man(7) to roff(7)?

In other words, you want to see what a *roff document looks like after
all macro expansions have been (recursively) performed.

I wanted this, too, back in 2017 when I first started working on groff.

The short answer is "no".

The longer answer is that this is hard because GNU troff, like AT&T
troff, never builds a complete syntax tree for the document the way
"modern" document formatters do.  nroff and troff were written and
deployed on DEC PDP-11 machines that are today considered embedded
microcontroller environments.  Therefore they handled as little input at
one time as possible.  Roughly, this meant that input was collected,
macro-expanded as soon as it was seen, and then as soon as it was time
to break an output line, a lot of formatter state related to parsing was
flushed, and it started reading input again.

Understanding *roff a little better 6 years later, I can more easily
imagine ways to run AT&T troff out of memory on a PDP-11.  Ultra-long
diversions would be one way,[1] because formatted diversion contents
have to be kept in memory until they're called for.  A multiplicity of
moderately sizes diversions would do it too.  Conditional blocks would
be another problem.  When encountering a brace escape sequence \{, the
formatter has to scan ahead in the input.  Or at least GNU troff does.
Maybe AT&T troff did something clever, but its source code is famously
opaque.

I'll say it before Ingo does: mandoc(1) (as I understand it) _does_
build a syntax tree for the entire document before producing output,
which enables some of the nice features that it has.

I see Lennart has replied with some further exploration of the
challenges here.  Rather than duplicate his comments, let me move on to
something vaguely related but, I hope, potentially useful.

Can we do something that might help without re-architecting GNU troff?

I think we can.  I've been mulling this for months, and now that I'm on
the threshold of implementing a `for` request as a string iterator,[2]
I think I want something else first, largely to help me test it.

I want string/macro/diversion dumper.

groff(7):
       .pm        Report, to the standard error stream, the names and
                  sizes in bytes of defined macros, strings, and
                  diversions.

groff_diff(7):
       In AT&T troff the pm request reports macro, string, and diversion
       sizes in units of 128‐byte blocks, and an argument reduces the
       report to a sum of the above in the same units.  GNU troff
       ignores any arguments and reports the sizes in bytes.

That's fine, but what if we want to look _inside_ a macro, string, or
diversion?

I propose to implement this:

       .pm name   Report the contents of macro, string, or diversion
                  name to the standard error stream.  If name is
                  undefined, an error is produced (to distingush this
                  case from an empty object).  Newlines and ordinary
                  characters are written as-is on lines indented one
                  space.  Special characters are represented in \[xx]
                  notation regardless of the selected escape character
                  or input syntax.  Tabs, leaders, unprintable control
                  characters, and nodes are described on lines with no
                  indentation.

I suggest that this won't break existing code because:

A.  GNU troff has ignored arguments to `pm` for ~33 years; and
B.  The format of debugging output (`troff -a`, `pm`, `pnr`, `pev`,
    `ptr`), is not, and likely should not be, rigidly specified.

Example of an interactive session using the feature (purely notional,
typed into my editor window):

$ groff
.ds foo hello \(aq apostrophe\" string contents are read in copy mode
.pm foo
 hello \(aq apostrophe
.de bar
.  ft B
.  nop Hello, world!
.  ft
..
.pm bar
 .de bar
 .  ft B
 .  nop Hello, world!
 ..
.  ft
.ds toc*entry 1.1^IIntroduction^Aiii
.pm toc*entry
 1.1
tab
 Introduction
leader
 iii
.de OB\"noxious old fart who knows tricks
.  if ^B\\$1^Bfatal^B .ab \" get out in a panic
.  ex \" exit more calmly
..
.pm OB
 .de OB
 .  if 
^B
 \\$1
^B
 fatal
^B
 .ab 
 .  ex 
 ..

A problem with the above format is that trailing spaces before newlines
would not be obvious.  I'm thinking that won't be too hard to address;
the dumper can count spaces until it encounters something that isn't
a space, newline, or the end of the object.  We could then have
something like this.

.pm OB
 .de OB
 .  if
space
newline
^B
 \\$1
^B
 fatal
^B
 .ab
space
newline
 .  ex
space
newline
 ..

It would be more consistent, and possibly better, to just mark all
newlines thus.

I admit I don't really know yet what I'll be dealing with when it comes
to dumping nodes (which will be all over the place in diversions).

But, then, that aspect of groff seems to have mystified many over the
years.[2]  I very much hope that being able to "debug print" them will
start to clear away the smoke and confusion.  I want to do more than
just say that a node has been encountered.  I want something like this.

.di mydiv
ca-fe
.ft B
heavy
.di
.pm mydiv
node {type=glyph, id='c', font-position=1}
node {type=glyph, id='a', font-position=1}
node {type=glyph, id='\hy', font-position=1}
node {type=glyph, id='f', font-position=1}
node {type=glyph, id='e', font-position=1}
newline
node {type=glyph, id='h', font-position=3}
node {type=glyph, id='e', font-position=3}
node {type=glyph, id='a', font-position=3}
node {type=glyph, id='v', font-position=3}
node {type=glyph, id='y', font-position=3}

True node data will, I'm sure, be much more complex and verbose.  Likely
my first cut would be lamer.

.pm mydiv
node
node
node
node
node
newline
node
node
node
node
node

But I would want to swiftly improve that to report at least some basic
type information about the node.  Once I know what that looks like.

Any objections?

Regards,
Branden

[1] Nobody _except_ mandoc(1) seems to handle this well.  Credit where
    it's due.  https://savannah.gnu.org/bugs/?64229

[2] https://lists.gnu.org/archive/html/groff/2020-10/msg00105.html

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]