groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] Extracting Groff markup from ghostscript


From: Larry Kollar
Subject: Re: [Groff] Extracting Groff markup from ghostscript
Date: Mon, 21 Jun 2004 20:29:36 -0400

address@hidden wrote:

Just wondering of there' a way to "reverse engineer" a ghostscript file and
recreate the groff source file?

You keep pushing my "got to answer this" button. :-) In a nutshell,
it should be possible but nobody has actually gone and done it.
I've actually taken a step or two in that direction, but have a way
to go before I could start producing groff markup.

Start with a copy of the "ps2ascii" script, and remove the string
"-dSIMPLE" from the end of the OPTIONS line. Instead of just
dumping text, ps2ascii-copy then outputs a very simple command
language consisting of the following commands:

        F height width (fontname)               <-- font name literally in ( )
        P                                               <-- page break
        S x y (string) width                    <-- display text at the 
specified point

The first thing you'll notice is that the text is extremely fragmented;
it might even break words apart in some cases. I wrote a simple
awk script to join strings that need joining; it's very easy to parse
this stuff with awk. Another script throws away everything above a
certain point and below another point (to get rid of headers and
footers).

At that point, you can identify paragraph breaks by vertical gaps
and font/size changes by the F command. You'll probably have
to work a little harder to identify headings (NH etc); if the original
output has numbered headings you can work with those. Without
numbered headings, you'll have to key on font and size changes.

Hope that gets you started.
--
Larry Kollar     k  o  l  l  a  r  @  a  l  l  t  e  l  .  n  e  t
Unix Text Processing: "UTP Revival"
http://home.alltel.net/kollar/utp/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]