groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] Having a problem with parsing output to html...


From: Keith Marshall
Subject: Re: [Groff] Having a problem with parsing output to html...
Date: Fri, 25 Mar 2011 09:53:18 +0000

On 25 March 2011 04:38, Werner LEMBERG wrote:
>
> Justin,
>
> a simple example says more than thousand words...  So please give us
> an example we can examine.

Hear!  Hear!

> At a first glance, it seems you have an encoding problem (but this
> doesn't explain the strange things you see).  The default encoding of
> groff is latin1, and your input file is probably UTF8.  Starting with
> version 1.20, groff can handle UTF8 by use a new preprocessor.
>
> The HTML output driver is still experimental (and basically
> unmaintained currently due to lack of time and interest); it is easily
> possible that you've found a bug.

Equally -- perhaps more -- likely, Justin has encountered a hyphenation
issue.  This:

> On the 11th in my groff file, an "â" character is found after 64
> characters have been printed, within the word hamburger, the text gets
> parsed and printed as "hamâburger". If I change hamburger to donations
> I have the "â" character show up at the 60th character on the line,
> with donations being "donaâtions".

is reminiscent of an issue I myself observed, earlier this week.  I had
run some informally structured ASCII text through a sed filter, and then
through nroff, (v1.20.1), to produce an alternative layout.  Although I
had suppressed hyphenation (.hy 0), I did have several explicit ASCII
hyphen characters in the input stream; each of these was replaced, in
the output stream, by the three byte octal sequence 342 200 220, (which
I guess represents u2010 -- the Unicode hyphen which groff_char(7)
documents as the output form for hyphen).

Viewing this output with "less", on my UTF-8 aware console, it looked
absolutely fine, but after uploading as a package description file on my
SourceForge downloads page, each hyphen was rendered, by Firefox, with
unwanted whitespace surrounding it; rendered by Internet Explorer, each
hyphen was replaced by three characters of garbage, amongst it being the
"â" observed by Justin, IIRC.

So yes, I guess what you actually see is dependent on encoding, (and how
the viewer interprets the u2010 sequence, however it is encoded).  In my
case, I wanted real ASCII hyphens in my output stream; adding "-Tascii"
to my nroff command gave me that.

-- 
Regards,
Keith.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]