groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] On copying text from PDF files that started with groff


From: Ralph Corderoy
Subject: Re: [Groff] On copying text from PDF files that started with groff
Date: Thu, 25 Jan 2007 13:54:30 +0000

Hi Stephen,

> Looking at the postscript it is evident that the postscript
>     (.C)-.18 G(hec) -6.672 E 3.336(kc)-.24 G(eliac antibody panel and
>     vitamin panel.)-3.336
> is causing problems for several later processes.
> 
> I was surprised to see the strings in postscript output as they are.

It appears PostScript's ashow operator,

    http://www.capcode.de/help/ashow

is being used by groff to produce the space between words, i.e. paint
".C" and "kc" spreading the characters out as it moves through the
string.  The URL gives a picture that makes this clear.  The amount of
spread is high enough to create the space as if ". C" and "k c" where
painted with the show operator.

> So, with all that, is there an option to get groff to stop adjusting  
> withing words?  For my needs, adjusting whitespace size is all that  
> is needed.  Or should all this be referred to a grops mailing list?

I think some of the problem is spacing between words, e.g. the ".C"
above.  Presumably, some of the PDF scavangers don't realise that
ashow's text may be parts of two words unless there's whitespace in the
string.

Perhaps someone here will suggest a groff workaround.  Otherwise,
contacting the authors of the text extraction tools may be worthwhile.
Or produce a parallel version of your PDF document and index that, e.g.
PHP writes ASCII, or PHP writes troff and format that to PostScript and
also ASCII.

Cheers,


Ralph.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]