groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] devpdf U-fonts and Russian


From: Tadziu Hoffmann
Subject: Re: [Groff] devpdf U-fonts and Russian
Date: Fri, 6 Oct 2017 21:56:56 +0200
User-agent: Mutt/1.5.24 (2015-08-30)

> This is interesting. I thought that PDF had a more efficient way
> of storing data than PostScript and as a result allowed for
> faster reading and writing, although I've never looked into the
> details. I haven't yet switched over from using grops to gropdf,
> but I was beginning to think it was an inevitable path to better
> document processing. Can you explain your preference for
> PostScript a little further?

It's complicated.

Postscript is a programming language, and as such there's
no real limit to what you can do with it.  One issue with
respect to documents is page independence (or rather, the
lack thereof): something you do on page 3 of the document
can influence what happens on page 27.  This is no big deal
for stuff printed sequentially on a printer, but it wreaks
havoc in documents which a user can browse in random page
order on a screen.

Of course, there is no compulsion that later pages must
depend on earlier ones -- indeed, I don't think there is a
good reason nowadays not to format the document so that all
pages are independent of each other -- but page independence
is not a requirement, only an option.  Adobe has issued a
set of guidelines (the Document Structuring Conventions)
which may be used to inform a document processor whether
pages must be processed sequentially or whether they can be
accessed in random order.  This is also only a convention,
and again strictly optional -- a printer (which processes
pages in sequence) does not require it.

As a result, many Postscript generators have opted for the easy
way out: just dump everything into the stream in the order it
is needed for sequential printing, and screw page independence.
This approach of course makes document (re)processing (such as
extracting individual pages) just so much harder.

My guess is that PDF was an attempt to master the resulting
chaos.  PDF requires all pages to be independent, and also
requires the document producer to explicitly state which
resources are used by each page.  These are noble goals, but
the file format chosen to achieve this is complex, with byte
pointers to different objects all over the place.  Apart from
the built-in compression, this means you can't simply edit the
file in a text editor (as you can with Postscript) if something
in the document is not to your liking -- if you add or remove
characters, many of the pointers will be off, rendering the
document unreadable.  (Then again, Adobe is in the business
of selling you software to edit PDF files, so simplicity in
the file format is not necessarily in their interests.)

Some of the complexity in the file format can be seen as
catering toward making the life of document creator programs
easier (and consequently, a little harder for the document
viewer), instead of making the document creator do just a
little more work in order to allow a simpler file format.
(I'm referring to the practice of making stream lengths an
indirect object, allowing the document creator to calculate
the stream length as the stream is being output, then putting
the computed length at the *end* of the stream with its own
object number and entry in the object table, instead of having
the document creator compute this beforehand and put it at
the beginning of the stream, so that the viewer can know the
expected stream length before reading the stream and without
having to do an object lookup.)

Furthermore, PDF does away with the single greatest feature
of Postscript: programmability.  (Just imagine groff without
macros.)  Sure, allowing loops and conditionals and whatnot
in a document can cause unpredictability, but when done right
it can make document structure so much simpler, for example
with subroutines that accept arguments to draw repeated
graphic objects with slight variations.  (Datapoints in a
scatterplot, for example, with different shapes, colors,
and sizes.  Gnuplot, for instance, makes nice use of this
capability, and you can easily tweak the appearance by
editing the Postscript code.)

Postscript's integration of programming constructs and graphics
functions is extremely elegant.  The idea of treating all
graphic objects (including the letters of text to be printed)
as paths to be filled and stroked really shines when combined
with the ability to manipulate these paths and the coordinate
transformation matrices on-the-fly through algorithms.
And the stack-based postfix approach works well when you
have to pass around, load and store, and otherwise perform
computations on and manipulate data that can ultimately be
used to draw stuff on the page.

PDF uses the same postfix syntax in the actual page streams.
But without the ability to manipulate objects, the stack-based
approach loses its utility -- putting stuff on the stack
to retrieve later doesn't make much sense if you can't do
anything with it.  (Of course, PDF is derived from Postscript;
if you already have a Postscript interpreter, it means you
can reuse large parts of it.)

In a nonprogrammable language, passing data to a graphics
function only when and where it is needed seems much more
straightforward.  A simple syntax consisting of a function name
with arguments following (e.g., as in HP's graphics language
HPGL, or as in groff) would be much easier to parse for the
document viewer, in particular because you can use an optimized
syntax for each function.  (Some functions only accept numbers,
some only text, etc., so debugging will also be easier.)

Add to this recent developments like cross-reference streams,
"compatible" PDFs which include both a cross-reference
table *and* a cross-reference stream (wtf?), embedded XML
for "semantic" purposes, and PDF comes across as a terrible
hodgepodge of different syntaxes.  Of course all of this is
understandable from its history and evolution, but PDF is
far from being an elegant file format.



PS: I used to think that Postscript drivers would output
ugly code, but the atrocities committed under the name PDF
are worse.  Many PDF creator programs ignore all the operators
Adobe has provided to make text printing simpler, and output
only the most braindead code imaginable.

PPS: I forgot to mention Javascript: yet another different
language grafted on.  I predict that one of these days Adobe
will see the usefulness of programmability within the PDF page
streams, and add still another language to provide this.  :-)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]