groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] Simplifying groff documentation


From: Eric S. Raymond
Subject: [Groff] Simplifying groff documentation
Date: Fri, 22 Dec 2006 18:19:15 -0500
User-agent: Mutt/1.4.2.2i

Werner LEMBERG <address@hidden>: 
> > Who is the person currently responsible for groff?  Is it still you?
> 
> Yep.  Awaiting your commands.

So I see from the groff project page, which I should have checked
first.  Copying to Ted Harding and the groff list, which I just
subscribed to.

I want to drastically simplify the markup used in several pieces of
groff documentation, eliminating a lot of the hairy custom macros they
presently use.

groffer.1
groff_out.5
groff_tmac.5
groff.7.gz
groff_char.7
groff_mdoc.7
groff_trace.7

Technically this won't be hard; I could make the required changes in a few 
hours.  But I hear you asking "Why fix what ain't broken?". 

The immediate technical answer is "the macro hackery is getting in the
way of lossless translation to a Web-ready format".  The more extended
answer raises some philosophical issues about groff's place in the world.

Extended rant follows...

You may (or may not) be aware that one of the baskground tasks I've
been pursuing for years is an effort to clean up the mess that is Unix
documentation formats.  It's the 21st century, all the documentation
on my system ought to present as a hypertexted local Web through my
browser.  But a "big bang" solution -- everybody rewriting their
stuff in HTML or whatever -- can't be imposed, if for no other reason
than that the coordination problem is too hard.

That means that in order for us to get to hypertext Nirvana, there
have to be lossless (or near-to-lossless) translation paths from every
legacy format to HTML.  If we have that, then we can subsume everything
and allow the legacy  formats to die quietly (or survive as composition
markups than nobody actually delivers in).

The hardest format to webify in the Unix world is also the most
important one -- man pages. (By way of GNUish contrast, TeXinfo is
much easier.)  There are a large number of tools that attempt this
out there.  In general, they do a crappy job.

Five years ago I decided to solve this problem.  And I did.  I wrote a
program called 'doclifter' that takes man-page sources in one end and
emits XML-Docbook out the other.  XML-Docbook to HTML is, of course,
easy.

(But man-to-DocBook *wasn't* easy.  Doclifter is nearly 8000 lines of
Python embodying both more parsing technology than many compilers for
general-purpose programming languages and an entire rule-based
production system.  I have seen master's-thesis projects in AI with
less AI in them than doclifter.  No joke.)

Why go through DocBook?  Because, it turns out, the way to *not* 
do a crappy job of translation is to do structural analysis on the
markup.  DocBook carries the structural information needed to do 
stylesheet-based HTML generation at *much* higher quality than (say)
latex2html ever manages with its purely presentation-level approach.

In the five years since I wrote doclifter, I've been using it to
do periodic audits of the man-page corpus, or at least as much of it
as is represented by a full-boat Red-Hat/Fedora-Core installation.  In
FC6 this is over 13,000 man pages.

The purpose of these audits is twofold:

(1) Improve doclifter's performance (its clean-translation rate is
now 96%).

(2) Feed fix patches back to man-page maintainers to clean up 
broken markup (I've had nearly 300 patches accepted).

The end goal is to be able to announce that transitioning away
from man pages to HTML is a *solved problem*.  When I get the
look-ma-no-hands rate below 1%, I figure we can declare victory
and go to the next phase.

Clue about the next phase: last year I got a change into the man(1) 
sources is that tells it what to do when it finds an HTML source
where it's expecting a man page, e.g hand off to a browser. The
technical preconditions are nearly in place to kill off man pages
as a presentation format. Think about that :-)

After five years of effort, I am down to fewer than 4% translation
failures.  I'm to the point where pushing individual man-page 
cleanups to individual projects is actually more efficient than
crocking doclifter to handle yet another weird edge case.

To give you an idea of the numbers, my last full test was on 13,466 man
pages.  Of these, 391 (2%) require fix patches.  I expect about half
of these fix patches will be applied upstream within the next 90 days;
others will take longer, depending on project release intervals.

There remains a tiny hard core of 47 pages (0.3%) that can't be
fix-patched.  They remain unliftable.  Of these, 25 are from netpbm
and 7 (0.05%) are from groff.

Thus, groff is my second largest source of man pages that can't be lifted
to DocBook. The largest is netpbm, and I'm working with its maintainer
to fix that now.  

So this is the answer to "why fix it?".  Because the groff pages 
presently do elaborate, bizarre things that doclifter can't cope with.
In this they are *unique*.  I mean *unique*.  Everywhere else the
problem is almost entirely broken markup, not things people did
deliberately.

I want to fix the groff documentation so that it's no longer in the
way of automatic lifting of *everything* to HTML.  (As a side benefit,
the markup in the groff documentation will become easier to maintain.)
The only downside might be a slight decrease in the visual quality of
the printed versions -- in particular, command synopses might no
longer look quite as pretty.

The philosophical issue this raises about groff's place in the world
is simple: are we willing to accept that it's a legacy rather than
a primary format?

I don't ask this question dismissively.  I probably grok *roff hackery
as well as anybody who isn't Brian Kernighan -- groff carries two tools
I wrote (pic2graph and eqn2graph) and I wrote your guide to pic.  I think
man macros will still have a place as a composition format, even if nobody
presents from them any more.

But I think it's time to move on.  This little change will help us
get to a fully-hypertexted, Web-centric documentation corpus. Let's
do it.

(And brace yourselves for the *real* political bunfight, which 
is when I try to kill off GNU info...)
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]