groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] Troff to xml


From: Pierre-Jean
Subject: [Groff] Troff to xml
Date: Wed, 13 Jun 2012 19:29:23 +0200
User-agent: Heirloom mailx 12.5 7/5/10

Dear troffers,

I'd like to share here some thoughts about the translation
of a troff file to xml. This is not, in my opinion, a very
difficult task, the problem is more that there's not one way
to do that, and so no standart way, and so a lot of partial
solutions. We know one of them: 'groff -T xhtml' ; some of
us might know that plan9 shares 'htmlroff' ; we can also
think at 'manserver', 'mstohtml' and such other tools that
we can still find on the web.

All these solutions have the same two problems:
- they do not handle the mini-languages of the preprocessors,
  they usually build a picture of tables and equations
  instead.
- they have choosen to include some layout information in
  the xml - which make an ugly xhtml.
In one hand, there's not enougth information (data of
tables, equations, pictures and references are lost), on the
other hand, there's too much information (layout is usually
not needed in an xml file).

This strange use of xml is comprehensive when xml is
restricted to xhtml. But the power of xml is its
versatility, it's ability to be transformed. We should
prefer a good xml file, even if it's not a known format and
transform it using xslt than having an xml file specific to
an application - html here, odt somewhere else.

But what is more strange is that nobody seems to have think
at the simpliest way to translate a troff file to an xml
one: using nroff itself. Here is an example:

 .de PP
 \\*[END]
 <P>
 .ds END </P>
 ..

Writing a macro to convert to xml is much more easy than
writing a troff macro, as we don't need to handle layout.

There's, of course, a lot of small problems doing things
that way.  One of them concern fonts written '\fX'. A
pre-processor can transform them to '\*X', so that troff can
replace these strings by some xml tag. One other problem
concern the trailing newline, that might result in a
not wanted space, for example after the <P> in the macro
hereinbefore. A simple post-processor could take care of
this.

One last problem concern the mini-languages. But it's not a
hard task to write pre-processors that transfer them to xml.
Groff eqn can produce MathML. I've made some test with the
heirloom refer: it's easy to transform it's output to xml.
A tbl to html might exist somewhere, even if I didn't find
one. The major difficulty could be a 'pic to svg' tool.

So, we could translate some file to xml using a nice command
line:

 prexml file.tr | xrefer | eqn -T MathML | xtbl | xpic | \
 nroff -mxml | postxml > file.xml

Let's close this mail with a discussion concerning refer.
Refer is the onliest pre-processor that I use. I use it
always, I use troff because refer is one of the best tool to
deal with references. How should we translate references?
There's two solution: we can use troff to format the
reference (insert 'idem.' if needed, delete some
informations, add some punctuation, etc.), or we can decide
that the xml file should be as neutral as possible, and use
xml tools to do these transformations. Even if I don't know
how to do use xslt, the second solution seems to be the
correct one, as we don't loose information in the translation
from troff to xml, and because we use the power of xml. It's
not hard to patch refer to let it produce something like
that:

 <A>Author</A>
 <T>Title</T>
 <C>City</C>
 <I>Issuer</I>
 <D>Date</D>

And it's propably possible to use xslt to get something like
this:

 <smallcaps>Author</smallcaps>,
 <italic>Title</italic>,
 City: issuer, date.

I think that this way of transforming troff to xml is the
easiest one and the correct one. It's a troff way, because
end user can build it's own xml macro, but it's also the xml
way, because it let xml do the most difficult thing, say use
some standart (like odt).

As a proof of concept, here is a macro to convert a subset
of ms (SH, PP, QP, FS, FE and some fonts written as
\*X) to flat odt. Use like this:

groff -Tutf8 sxml f.tr > f.fodt

open it with your office suite, and enjoy!

Pierre-Jean.

Attachment: sxml
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]