groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] On copying text from PDF files that started with groff


From: Stephen Holland
Subject: [Groff] On copying text from PDF files that started with groff
Date: Wed, 24 Jan 2007 20:18:59 -0600

I have been using groff to programmatically create PDF documents for my medical practice. The workflow is a web page controlled with PHP generates a text file that is processed by groff. The postscript file them is run through pstopdf and I have the PDF I need. I have been delighted at the ease with which it works.

Recently I was using my Mac's spotlight search engine to look for files through keyword searches and found I was having trouble locating pages. Also, when I copy text from a PDF so generated the text copies with odd spacing.

This behavior seems to be related to kerning functions generated by groff. Following is an example of the problem.

The following text was part of a document I passed through groff ( === delimits the example text ):

===
Findings: Mild diffuse thickening of the esophagus with linear furrows in the mid esophagus and circumferential rings in the proximal esophagus. The mucosal vascular pattern is effaced. Patchy erythema in the third portion of the duodenum. Biopsies taken in the duodenum, stomach, and esophagus. Recommendations: This appears to be eosinophilic esophagitis. Start Protonix, 40 mg qd. Avoid milk and eggs. Check celiac antibody panel and vitamin panel.
===

groff obligingly creates a PS file containing:

===
(Findings: Mild diffuse thic)72 416 Q -.24(ke)
-.24 G(ning of the esophagus with linear furro).24 E
(ws in the mid esopha-)-.18 E(gus and circumf)72 430 Q(erential r)-.36 E
(ings in the pro).18 E(ximal esophagus)-.36 E 6.672(.T)-.18 G(he m)
-6.672 E(ucosal v)-.12 E(ascular patter)-.3 E(n).3 E(is eff)72 444 Q
3.336(aced. P)-.36 F(atch)-.48 E 3.336(ye)-.36 G .36(ry)-3.336 G
(thema in the third por)-.36 E(tion of the duoden).48 E 3.336
(um. Biopsies)-.12 F(tak)3.336 E(en in the)-.24 E(duoden)72 458 Q
(um, stomach, and esophagus)-.12 E(.)-.18 E
(Recommendations: This appears to be eosinophilic esophagitis)72 486 Q
6.672(.S)-.18 G -2.856(tar t)-6.672 F(Protonix, 40 mg)3.336 E 3.336
(qd. A)72 500 R -.3(vo)-.48 G(id milk and eggs).3 E 6.672(.C)-.18 G(hec)
-6.672 E 3.336(kc)-.24 G(eliac antibody panel and vitamin panel.)-3.336
===

and when run through pstopdf a PDF appears. When copying out the paragraph above one gets:

===
Findings: Mild diffuse thickening of the esophagus with linear furrows in the mid esopha- gus and circumferential rings in the proximal esophagus. The mucosal vascular pattern is effaced. Patchyerythema in the third portion of the duodenum. Biopsiestaken in the
duodenum, stomach, and esophagus.
Recommendations: This appears to be eosinophilic esophagitis. Star tProtonix, 40 mg
qd. Avoid milk and eggs. Checkceliac antibody panel and vitamin panel.
===

Note that the words 'Patchy erythema' and 'Biopsies taken' are run together. The words 'Start Protonix' are morphed to 'Star tProtonix'


When checking what the text parser for the mac sees the problems are repeated. mdfind, the import process for Mac OSX finds the following words:

===
Findings: Mild diffuse thickening of the esophagus with linear furrows in the mid esopha- gus and circumferential rings in the proximal esophagus. The mucosal vascular pattern is effaced. Patchyerythema in the third portion of the duodenum. Biopsiestaken in the duodenum, stomach, and esophagus. Recommendations: This appears to be eosinophilic esophagitis. Star tProtonix, 40 mg qd. Avoid milk and eggs. Checkceliac antibody panel and vitamin panel.
===

The reason this is a problem is that the indexing program now is not getting correct input and the index into my files misses that this patient document should be found with the term celiac. It will find the document with the term 'checkceliac' as a single word.

Looking at the postscript it is evident that the postscript
(.C)-.18 G(hec) -6.672 E 3.336(kc)-.24 G(eliac antibody panel and vitamin panel.)-3.336
is causing problems for several later processes.

I was surprised to see the strings in postscript output as they are.

So, with all that, is there an option to get groff to stop adjusting withing words? For my needs, adjusting whitespace size is all that is needed. Or should all this be referred to a grops mailing list?

Steve Holland




reply via email to

[Prev in Thread] Current Thread [Next in Thread]