groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] HTML fonts


From: Gaius Mulley
Subject: Re: [Groff] HTML fonts
Date: Sun, 23 Jan 2000 20:45:27 +0000 (GMT)

Werner writes:

> Yes, please post, or maybe it's better to update design.ms, filling up
> the TODO list?

ok, here it is. I didn't add them to the design.ms after all as
I think it is more useable as a plain text document - as one
can cut and paste examples etc.

Here is the design.ms file without the "To do" section now
followed by a long TODO file. Werner perhaps you could check in
the new TODO file in grohtml ?? - as well as update the design.ms

Hope you find them useful,

Gaius


design.ms
.nr PS 12
.nr VS 14
.LP
.TL
Design of grohtml
.sp 1i
.SH
What is grohtml
.LP
Grohtml is a back end for groff which generates html.
The aim of grohtml is to produce respectible html given
fairly typical groff input.
.SH
Limitations of grohtml
.LP
Although basic text can be translated
in a straightforward fashion there are some areas where grohtml
has to try and guess text relationship. In particular whenever
grohtml encounters text tables and indented paragraphs or
two column mode it will try and utilize the html table construct
to preserve columns. Grohtml also attempts to work out which
lines should be automatically formatted by the browser.
Ultimately in trying to make reasonable guesses most of the time
it will make mistakes occasionally.
.PP
Tbl, pic, eqn's are also generated using images which may be
considered a limitation.
.SH
Overview of html.cc
.LP
This file briefly provides an overview of how html.cc operates.
The html device driver works as follows:
.IP (i) .5i
firstly it creates a linked list of all words on a page.
.IP (ii) .5i
it runs through the page and finds the left most margin. Later
on when generating the page it removes the margin.
.IP (iii) .5i
scans a page and builds two kinds of regions ascii text and graphical.
The graphical regions consist of tbl's, eqn's, pic's
(basically anything that cannot be textually displayed).
It will scan through a page to find lines (such as footer etc)
and places these into tiny graphical regions. Certain fonts
also are treated as a graphical region - as html has no easy
equivalent. For example Greek math symbols.
.LP
Finally all graphical regions are translated into png files and
all text regions into html text.
.PP
To give grohtml a sporting chance of accuratly deciding which
is a graphical region and which is text, the front end programs
tbl, eqn, pic have all been tweeked to encapsulate pictures, tables
and equations with the following lines:
.sp
.nf
\f[CR]\&.if '\\*(.T'html' \\X(graphic-start(\c

\&.if '\\*(.T'html' \\X(graphic-end(\c
\fP
.fi
.sp
these appear to grohtml as:
.sp
.nf
\f[CR]\&x X graphic-start

\&...

\&x X graphic-end\fP
.fi
.sp
.LP
In addition to graphic-start and graphic-end there are two
other "special characters" which are used.
.sp
\f[CR]\&x X index:N\fP
.sp
where N is a number. The purpose of this sequence is to stop
devhtml from automatically producing links to headings which
have a header level >N.
The line:
.sp
\f[CR]\&x X html:STRING\fR
.sp
.LP
allows a STRING to be passed through to the output file with
no processing whatsoever. Ie it allows users to include html
commands, via macro, such as:
.sp
\f[CR]\&.URL "Latest Emacs" "ftp://somewonderful.gnu.software"\fP
.sp
.LP
Where the URL macro bundles the info into STRING above.
For more info consult: \f[CR]tmac/tmac.arkup\fP.
.PP
While scanning through a page the html device copies headings and titles
into a list of links which are later written to the beginning
of the html document.
.SH
Table handling code
.LP
Provided that the -t option is not present when grohtml is run the grohtml
driver will attempt to find textual tables and generate html tables.
This allows .RS and .RE commands to operate with auto formatting. It also
should grohtml to process .2C correctly. However, the table handling code
has to examine the troff output and \fIguess\fR when a table starts and
finishes. It is well to know the limitations of this approach as it
sometimes makes the wrong decision.
.LP
Here are some of the rules that grohtml uses for terminating a html table:
.LP
.IP "(i)" .5i
A table will be terminated when grohtml finds line which is all in bold
font (it believes that this is a header which is outside of a table).
This might be considered incorrect behaviour especially if you use .2C
which generates a heading on the left column when the corresponding
right row is blank.
.IP "(ii)" .5i
A table is terminated when grohtml sees that the complete line is
has been spanned by words. Ie no gaps exist.
.IP "(nb)" .5i
the documentation about these rules is particularly incomplete and needs 
finishing
when time prevails.
.SH
Dependencies
.LP
Grohtml is dependent upon grops, gs which are invoked to
generate all png files. Png files are generated whenever a table, picture,
equation or line is encountered.
--------------- cut here--------------- cut here--------------- cut here

To do list

------------------------------------------------------------------
finish working out the max and min x, y, extents for splines.
------------------------------------------------------------------
check and test thoroughly all the character descriptions in devhtml
(originally taken from devX100)
------------------------------------------------------------------
improve tmac.arkup
------------------------------------------------------------------
also improve documentation.
------------------------------------------------------------------
fix the bugs which are exposed by Eric Raymonds pic guide,
"Making Pictures With GNU PIC". It appears that grohtml becomes confused
about which sections of the document are text and which sections need
to be rendered as an image.
------------------------------------------------------------------
it would be nice to modularise the source. A natural division might be
to extract the table handling code from html.cc into table.cc.
The table.cc could be expanded to recognise output from tbl and try
and generate html tables with lines/rules/boxes. The code as it stands
should cope with very simple plain text tables. But of course at present
it does not get a chance to do this because the output of gtbl is
bracketed by \fCgraphic-start\fR and \fCgraphic-end\fR.
------------------------------------------------------------------
introduce anti aliasing for the images as mentioned by Werner.
------------------------------------------------------------------
improve generation of html. Perhaps by using a stack of current
html commands and using a kind of peephole optimizer on the stack?
Certainly the html should be buffered and optimized.
------------------------------------------------------------------


Informal to do bug list and done list
=====================================

This very informal and I've included some comments. Mainly consists
of a emailed bugs and wish lists. All very useful and welcome.

------------------------------------------------------------------
Dean writes: (address@hidden)

I noticed also that the TOC appears immediately after the title, splitting
it from the author and abstract.  Any chance it can be moved down?

gaius> this should be straight forward. (Not done yet though)
------------------------------------------------------------------
Werner writes:

Gaius,

checking a weird man page written by myself in German (using German
hyphenation patterns also :-), I found some more bugs:

.) Look at the following:
 
[\c
...\^\c
] 
[\c
.BI -P \ \%Plattform-ID\^\c
]

   This translates to

[<font size=3><B>-E</B> <font size=3><I>Kodierungs-ID</I> <font size=3>]
                                                         ^
   (groff breaks the line after the final `]'.)

   There are two errors in it: First of all, the `\ ' command should
   be translated to `&nbsp;'.  Secondly, a blank has crept in (marked
   with `^'.  Apparently, this is related to whether it is the last
   item of a line or not.

--fixed-- 4 01 2000

.) The command `\(->', translates to the `registered' sign (or rather
   the character `0xAE') instead of a right arrow.

--nearly fixed-- 4/01/2000

gaius>   if we know the standard html character encoding for farrow which
gaius>   will work on *all* browsers then this can be fixed inside devhtml/TR
gaius>   etc. Otherwise I guess we could translate this character into ->
gaius>   in tmac.html ?

.) The following code produces ugly results -- is it possible to make
   the HTML result similar to the ascii output?

.in +4m
.ta 3iC
.I "Plattform   Plattform-ID (pid)"
\&.sp
.ta 3iR
Apple Unicode   0
.br
Macintosh       1
.br
ISO     2
.br
Microsoft       3
.PP

--fixed--  14/01/2000
------------------------------------------------------------------

Werner writes:

Nevertheless, still some bugs in it.  As usual, I'm refering to man.1
of the mandb package; my command to create man.html was

  groff -U -t -man -Thtml -P-r -P200 man.1 > man.html

.) The `-w , --where, --location' node at the beginning of man.html
   shouldn't be there at all.

> .) Some paragraphs still contain hyphenated words (e.g. first
>    paragraph of the `DESCRIPTION' section).

Oops!  Please ignore this.  I forgot to include `-mhtml' :-)

.) Is it possible to have anti-aliased PNG images?

.) The item `man --help' in the `EXAMPLES' section doesn't start a new
   paragraph.

.) In the description of the -r switch (in the `OPTIONS' section),
   there is a new paragraph in the middle of a sentence.

.) What about centering the images?  Or does it depend on the table
   itself?

gaius> yes, grohtml places images at their relative position on the page.

.) In the `OPTIONS' section, `-c, --catman' and `-d, --debug' are
   glued together which shouldn't happen.
--fixed--

.) Sometimes, an empty line is missing between items, e.g. between the
   description of the -e and the -f options.

.) After the `-w, --where, --location' line, there is a superfluous
   empty line.

.) The indentation in the `FILES' section is inconsistent.  The same
   is true for `-V, --version' a few lines above.
------------------------------------------------------------------

Werner writes:

PNGs created by grohtml have apparently a white background -- isn't it
possible to make the background transparent optionally?

Another suggestion: What do you think about calling the PNG files
<groff_input_file>-<index>.png or something like this?  I can't see an
advantage in the current naming scheme except for debugging purposes
where it may be necessary to stay with the old files.

--fixed-- 04 01 2000

gaius> however I've had to retain a default grohtml-pid-index.png for all
gaius> stdin as we don't know the filename.. sadly looks like everything..
gaius> Nearly done by including a new tcommand 'F filename'

------------------------------------------------------------------

from Steve Blinkhorn <address@hidden>

One thought that came immediately to mind after our first trials.
If grohtml depends on grops, should there not be an easy interface to
allow PostScript code to be interpreted into the output?   For
instance, we generate our letterhead, including a logo, on the fly in
groff.   The logo is pure PostScript.   We use PostScript for colour
manipulation, and recently for generating a lot of graphics for
printing.

gaius>  should be interesting - if we can generate PS then GS it
gaius>  we should be in business
------------------------------------------------------------------
the logical place and name for a file describing tmac.arkup is
groff_markup.man placed into the `tmac' subdirectory, and your html.ms
looks like being this kind of file.

So I won't check it in currently -- may I ask you to convert this file
to a man page?

-- fixed --

Another related problem: I can imagine that a lot of people start to
write man pages with HTML output in mind also.  Nevertheless, it
should be still possible to display such pages correctly with a plain
text man pager.  As a consequence, such man pages should contain at
the beginning something like

  .do mso tmac.arkup

What do you think?

    Werner

-- fixed -- 
gaius> fixed by using troffrc-end I believe
--------------------------------------------------------------------
Gaius,

in troffrc, it appears to me that tmac.html is loaded if the output
device is HTML.  So why must I load it again (using -mhtml) to
suppress hyphenation for HTML output?  Can you provide a fix for this?

    Werner

gaius> fixed as above
--------------------------------------------------------------------

from (address@hidden) Rainer Daeschler

I recognized s problem limiting the usage for 
"none-english aliens". The generation of PNG of GIF,
skips all special characters like

      äöü ÄÖÜ ß

French, Spanish, and Scandinavian national letters, too.

--fixed-- 14/01/2000

An option which forces tables into HTML-code instead of building
an image would be most valuable. Of course it would not preserve
the original layout in many cases, but ease modifications of
the HTML-output to the users demand afterwards.

--fixed-- 14/01/2000

gaius> use the new -T option to grohtml (-P-T to groff)

-----------------------------------------------------------------
from Werner

   but `pre-defined' appears as `pre&shy; line' (note the space
   character after the soft hyphen).  Something in the code makes
   problems here...

   (IIRC, I've sent you this man.1 file a few weeks ago).
gaius> Werner fixed this by adding .cflags 0 -\(hy\(em\(en to tmac.html

.) The formatting of the paragraph after the first table is completely
   wrong.  It appears that the first few words are set in two columns;
   additionally, the indentation is incorrect.

.) Similarly, the description of `-l' in the OPTIONS section is
   idented incorrectly.  Wrong indentations happen still quite
   frequently.

.) In the description of the `-D' option, there is a blank line in the
   middle of a paragraph.



     Werner
-----------------------------------------------------------------
from Werner and Eddie
> > > .LP
> > > .URL Germany "ftp://groff.ffii.org/pub/groff/";
> > > |
> > > .URL USA "ftp://ftp.gnu.org/gnu/groff/";
> > 
> > Problem: the first "|" of each line is missing a leading white space
> > space.
> > 
> > How to ensure the spaces get put there?
> 
> This is a feature grohtml (unfortunately -- AFAIK, Gaius hasn't found
> a good workaround yet).  HTML stuff gets written as specials which
> don't consume space for troff, causing some miscalculation if placed
> at the beginning of a paragraph.  A workaround is to write
> 
> .LP
> \&
> .URL ...
> |
> .URL ...

gaius> fixed by adding \& to HTML as per Werner's suggestion







reply via email to

[Prev in Thread] Current Thread [Next in Thread]