Re: [Groff] Building a troff parser

groff
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] Building a troff parser

From:	Eric Andrew Lewis
Subject:	Re: [Groff] Building a troff parser
Date:	Tue, 3 Mar 2015 01:00:35 -0500
Ingo said:

For which specific purpose do you want to build this tool, and which set of
> manual pages to you want to process with it?


I would like to build a command-line port of Idan Kamara's wonderful
explainshell.com, which, given a command (e.g. rm -rf *) breaks it down it
into its constituent functional parts, explaining it.

In short, I'd like to make a program that does this:

$ explain "rm -rf *"
rm -rf *
└── rm       remove files or directories
    ├── -r   remove directories and their contents recursively
    ├── -f   ignore nonexistent files, never prompt
    └── *    Remove (unlink) files matching this text pattern.


The back-end for explainshell.com gets structured data about programs and
their flags by hitting Ubuntu's webpage for a command (rm.1.html
<http://manpages.ubuntu.com/manpages/precise/en/man1/rm.1.html>), and
parses its HTML. Some manual fixes are made for edge cases the parser
didn't account for. A bash lexer breaks apart the command to be explained
into an AST which is matched against the commands parsed from the manpages.

That works when for explainshell.com's context: a single instance web
service. I'm looking for a more fitting approach to create an
easy-to-install program that doesn't require querying Ubuntu man webpages.
Idan suggested building an explainshell HTTP API which a program could use,
but since the man / mdoc source files are on a user's computer already, I'm
considering parsing these files on-the-fly. Hence my interest in parsing
man / mdoc.

Thinking about it, multiple possible GSoC topics come to mind in this area,
> but it doesn't look like you are a student, right? So it may not help you
> if i were to set up a mentoring proposal, or would it?


I'm not a student, but I'd be interested in a mentoring relationship or
informal advice in regards to the project that doesn't require me to be a
degree seeker :)

doclifter is the best tool available for: man(7) pic(7) ms(7) me(7) mm(7)
> It is clearly the wrong tool for mdoc(7)


Why is doclifter the wrong tool for mdoc(7)? doclifter's documentation
states it supports mdoc(7).

I'm going to investigate using Eric S. Raymond's doclifter for the moment.

Thanks everyone, this has been a great overview.

Eric Andrew Lewis
ericandrewlewis.com
610.715.8560

On Fri, Feb 27, 2015 at 6:57 AM, Ingo Schwarze <address@hidden> wrote:

> Hi,
>
> > Eric Andrew Lewis wrote on Thu, 26 Feb 2015 07:49:18 -0500:
>
> >> I'm interested in building a troff parser to extract information
> >> from manpages (e.g. what do the flags mean when we say `rm -rf *`?).
> >>
> >> I'm curious, would the marked up source be the format to parse?
>
> Which format?  One thing making this a complex task is that there
> are so many languages involved, for example:
>
>  - mdoc macro language - called mdoc(7) below
>  - man macro language - man(7)
>  - low-level roff requests - roff(7)
>  - tbl table description language - tbl(7)
>  - eqn equation description language - eqn(7)
>  - pic picture description language - pic(7)
>  - ms macro language - ms(7)
>  - me macro language - me(7)
>  - mm macro language - mm(7)
>
> For which specific purpose do you want to build this tool, and
> which set of manual pages to you want to process with it?  The
> difficulty very much depends on the answer to these questions.
>
> If you want to avoid handling all of low-level roff(7) - see
> below for why - you have to handle to output of various man(7)
> code generators specially, in particular:
>
>  - pod2man(1) output from perlpod(1) input documents
>  - DocBook output
>  - ...
>
>
> Ralph Corderoy wrote on Thu, 26 Feb 2015 13:00:58 +0000:
>
> > That's a hard problem.  You may want to look at Eric Raymond's
> > doclifter.  http://www.catb.org/esr/doclifter/
>
> If you want to handle any and all features of low-level roff(7),
> it is indeed very hard.  Even doclifter handles only part of
> that.
>
> Getting back to the above list of languages, doclifter is the
> best tool available for: man(7) pic(7) ms(7) me(7) mm(7)
> It is clearly the wrong tool for mdoc(7), see below.
>
>
> Doug McIlroy wrote on Thu, 26 Feb 2015 15:46:56 -0500:
>
> > The syntax of troff and of the man-pages macros is in
> >
> >       man 7 groff
>
> Ironically, the full documentation of roff syntax for documentation
> is not in roff, but in texinfo format:
>
>   http://www.gnu.org/software/groff/manual/html_node/
>
> Also see the Heirloom troff manual,
>
>   http://n-t-roff.github.io/heirloom/doctools/troff.pdf
>
> That's an update of the one the OP cited,
>
>   http://cm.bell-labs.com/sys/doc/troff.pdf
>
> >       man 7 groff_man
> >       man 7 groff_mdoc
>
> An alternative, intentionally compatible definition of these
> two languages is provided at:
>
>   http://mdocml.bsd.lv/man/mdoc.7.html
>   http://mdocml.bsd.lv/man/man.7.html
>
> In cases of doubt, comparing both may help understanding.
>
> > The markup, however, is not faithfully used.  In groff -man,
> > you'll find boldface specified by .B , \fB, and perhaps .ft B
> > or .ft 3.
>
> Indeed, and .BR, .RB, .BI, .IB, .SH, .SS, and maybe even .SY.
> The main problem with man(7) is that it's not a semantic, but
> a presentational language in the first place.
>
> > And you'll find .I used for names of parameters
> > as well as for names of man pages (though parse context will
> > usually resolve the ambiguity.  groff -mdoc  tries for more
> > precision than man, but I suspect is sloppily used because
> > there are so many details to learn.
>
> That suspicion seems natural, but having seen lots and lots of
> both mdoc(7) and man(7) documentation, i don't share this
> suspicion.
>
> The largest single body of mdoc(7) documentation descends from
> the BSD system documentation of the Berkely Computer Systems
> Research Group.  It is still used in OpenBSD, FreeBSD, NetBSD,
> Dragonfly, and Minix 3.  In this body, mdoc(7) is not sloppily
> used, since the author of the mdoc(7) language is also the
> original author of these documents: Cynthia Livingston.
>
> Of course, the original body of AT&T Version 7 UNIX documentation
> written in man(7) was also very clean, but none of that remains
> in use in any major current system.
>
> Besides, in practice, there is a certain correlation between code
> quality and documentation quality.  People who focus on clean,
> small, and secure code often value clean and concise documentation
> as well - and sometimes favour mdoc(7) over man(7).  People prone
> to overengineering, bloat and sloppy work tend to produce either
> no documentation at all or bulky, incomplete, poorly formatted
> documentation.  They seem to prefer man(7) over mdoc(7) and often
> use low-quality code generators, in particular DocBook.
>
> So in practice, you find:
>  - A large amount of clean man(7) documentation.
>  - A very large amount of sloppy man(7) documentation.
>  - A large amount of clean mdoc(7) documentation.
>  - Some, but very little sloppy mdoc(7) documenation.
> which means that the average mdoc(7) document is much *less*
> sloppily written than the average man(7).  Besides, since the
> bulk of mdoc(7) documenation is BSD documentation, it tends
> to be very actively maintained.  Even though already reasonable,
> quality of markup consistency is actively being worked on at
> least in OpenBSD.
>
>
> Kristaps Dzonsons wrote on Thu, 26 Feb 2015 15:13:56 +0100:
>
> > If the pages are in mdoc(7) (which you indicated), just use
> > libmandoc(3) (http://mdocml.bsd.lv) to parse the file and
> > extract flags (`Fl') in the SYNOPSIS and correlate them to
> > their explanation in the DESCRIPTION's `Bl -tag' list.
> > Not difficult at all.
>
> Indeed, but depending on what exactly you want to do,
> still a considerable amount of work, even if you want to
> handle mdoc(7) only.
>
> Thinking about it, multiple possible GSoC topics come to mind
> in this area, but it doesn't look like you are a student, right?
> So it may not help you if i were to set up a mentoring proposal,
> or would it?
>
>
> Steffen Nurpmeso wrote on Fri, 27 Feb 2015 11:31:36 +0100:
>
> > For the mdocmx(7) project i have written a simple mdoc(7)
> > parser in awk(1), the entire thing 18966 bytes [...]
>
> That's clearly bad advice.  There are still missing parts
> in mandoc, but the mdoc(7) parser is among the parts that are
> most stable and best understood.  Rewriting *that* over and over
> again is not going to solve a problem.  Besides, an mdoc(7)
> parser written in awk(1) already exists, written in 1991
> by Henry Spencer:
>
>   http://manpages.bsd.lv/history/spencer_22_10_2011.txt
>   http://manpages.bsd.lv/history.html#x1991_awf
>
> Yours,
>   Ingo
>


Eric Andrew Lewis
ericandrewlewis.com
610.715.8560

On Fri, Feb 27, 2015 at 6:57 AM, Ingo Schwarze <address@hidden> wrote:

> Hi,
>
> > Eric Andrew Lewis wrote on Thu, 26 Feb 2015 07:49:18 -0500:
>
> >> I'm interested in building a troff parser to extract information
> >> from manpages (e.g. what do the flags mean when we say `rm -rf *`?).
> >>
> >> I'm curious, would the marked up source be the format to parse?
>
> Which format?  One thing making this a complex task is that there
> are so many languages involved, for example:
>
>  - mdoc macro language - called mdoc(7) below
>  - man macro language - man(7)
>  - low-level roff requests - roff(7)
>  - tbl table description language - tbl(7)
>  - eqn equation description language - eqn(7)
>  - pic picture description language - pic(7)
>  - ms macro language - ms(7)
>  - me macro language - me(7)
>  - mm macro language - mm(7)
>
> For which specific purpose do you want to build this tool, and
> which set of manual pages to you want to process with it?  The
> difficulty very much depends on the answer to these questions.
>
> If you want to avoid handling all of low-level roff(7) - see
> below for why - you have to handle to output of various man(7)
> code generators specially, in particular:
>
>  - pod2man(1) output from perlpod(1) input documents
>  - DocBook output
>  - ...
>
>
> Ralph Corderoy wrote on Thu, 26 Feb 2015 13:00:58 +0000:
>
> > That's a hard problem.  You may want to look at Eric Raymond's
> > doclifter.  http://www.catb.org/esr/doclifter/
>
> If you want to handle any and all features of low-level roff(7),
> it is indeed very hard.  Even doclifter handles only part of
> that.
>
> Getting back to the above list of languages, doclifter is the
> best tool available for: man(7) pic(7) ms(7) me(7) mm(7)
> It is clearly the wrong tool for mdoc(7), see below.
>
>
> Doug McIlroy wrote on Thu, 26 Feb 2015 15:46:56 -0500:
>
> > The syntax of troff and of the man-pages macros is in
> >
> >       man 7 groff
>
> Ironically, the full documentation of roff syntax for documentation
> is not in roff, but in texinfo format:
>
>   http://www.gnu.org/software/groff/manual/html_node/
>
> Also see the Heirloom troff manual,
>
>   http://n-t-roff.github.io/heirloom/doctools/troff.pdf
>
> That's an update of the one the OP cited,
>
>   http://cm.bell-labs.com/sys/doc/troff.pdf
>
> >       man 7 groff_man
> >       man 7 groff_mdoc
>
> An alternative, intentionally compatible definition of these
> two languages is provided at:
>
>   http://mdocml.bsd.lv/man/mdoc.7.html
>   http://mdocml.bsd.lv/man/man.7.html
>
> In cases of doubt, comparing both may help understanding.
>
> > The markup, however, is not faithfully used.  In groff -man,
> > you'll find boldface specified by .B , \fB, and perhaps .ft B
> > or .ft 3.
>
> Indeed, and .BR, .RB, .BI, .IB, .SH, .SS, and maybe even .SY.
> The main problem with man(7) is that it's not a semantic, but
> a presentational language in the first place.
>
> > And you'll find .I used for names of parameters
> > as well as for names of man pages (though parse context will
> > usually resolve the ambiguity.  groff -mdoc  tries for more
> > precision than man, but I suspect is sloppily used because
> > there are so many details to learn.
>
> That suspicion seems natural, but having seen lots and lots of
> both mdoc(7) and man(7) documentation, i don't share this
> suspicion.
>
> The largest single body of mdoc(7) documentation descends from
> the BSD system documentation of the Berkely Computer Systems
> Research Group.  It is still used in OpenBSD, FreeBSD, NetBSD,
> Dragonfly, and Minix 3.  In this body, mdoc(7) is not sloppily
> used, since the author of the mdoc(7) language is also the
> original author of these documents: Cynthia Livingston.
>
> Of course, the original body of AT&T Version 7 UNIX documentation
> written in man(7) was also very clean, but none of that remains
> in use in any major current system.
>
> Besides, in practice, there is a certain correlation between code
> quality and documentation quality.  People who focus on clean,
> small, and secure code often value clean and concise documentation
> as well - and sometimes favour mdoc(7) over man(7).  People prone
> to overengineering, bloat and sloppy work tend to produce either
> no documentation at all or bulky, incomplete, poorly formatted
> documentation.  They seem to prefer man(7) over mdoc(7) and often
> use low-quality code generators, in particular DocBook.
>
> So in practice, you find:
>  - A large amount of clean man(7) documentation.
>  - A very large amount of sloppy man(7) documentation.
>  - A large amount of clean mdoc(7) documentation.
>  - Some, but very little sloppy mdoc(7) documenation.
> which means that the average mdoc(7) document is much *less*
> sloppily written than the average man(7).  Besides, since the
> bulk of mdoc(7) documenation is BSD documentation, it tends
> to be very actively maintained.  Even though already reasonable,
> quality of markup consistency is actively being worked on at
> least in OpenBSD.
>
>
> Kristaps Dzonsons wrote on Thu, 26 Feb 2015 15:13:56 +0100:
>
> > If the pages are in mdoc(7) (which you indicated), just use
> > libmandoc(3) (http://mdocml.bsd.lv) to parse the file and
> > extract flags (`Fl') in the SYNOPSIS and correlate them to
> > their explanation in the DESCRIPTION's `Bl -tag' list.
> > Not difficult at all.
>
> Indeed, but depending on what exactly you want to do,
> still a considerable amount of work, even if you want to
> handle mdoc(7) only.
>
> Thinking about it, multiple possible GSoC topics come to mind
> in this area, but it doesn't look like you are a student, right?
> So it may not help you if i were to set up a mentoring proposal,
> or would it?
>
>
> Steffen Nurpmeso wrote on Fri, 27 Feb 2015 11:31:36 +0100:
>
> > For the mdocmx(7) project i have written a simple mdoc(7)
> > parser in awk(1), the entire thing 18966 bytes [...]
>
> That's clearly bad advice.  There are still missing parts
> in mandoc, but the mdoc(7) parser is among the parts that are
> most stable and best understood.  Rewriting *that* over and over
> again is not going to solve a problem.  Besides, an mdoc(7)
> parser written in awk(1) already exists, written in 1991
> by Henry Spencer:
>
>   http://manpages.bsd.lv/history/spencer_22_10_2011.txt
>   http://manpages.bsd.lv/history.html#x1991_awf
>
> Yours,
>   Ingo
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Groff] Building a troff parser, Eric Andrew Lewis <=
- Re: [Groff] Building a troff parser, Ralph Corderoy, 2015/03/03
  - Re: [Groff] Building a troff parser, Ingo Schwarze, 2015/03/03
- Re: [Groff] Building a troff parser, Mike Bianchi, 2015/03/03
  - Re: [Groff] Building a troff parser, Steffen Nurpmeso, 2015/03/03
- Re: [Groff] Building a troff parser, James K. Lowden, 2015/03/05
Prev by Date: Re: [Groff] pic: nasty little trap in function!
Next by Date: Re: [Groff] Building a troff parser
Previous by thread: [Groff] read-only number register
Next by thread: Re: [Groff] Building a troff parser
Index(es):
- Date
- Thread