Re: UTF-8 in grout and a performance regression (was: synchronous and as

groff
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 in grout and a performance regression (was: synchronous and as

From:	Deri
Subject:	Re: UTF-8 in grout and a performance regression (was: synchronous and asynchronous grout)
Date:	Mon, 23 Dec 2024 13:31:07 +0000
Apologies, Branden and onf were firing off emails at a rapid pace, this took 
rather longer to write, so some of the points below where already covered in 
their discourse.

On Friday, 20 December 2024 02:45:56 GMT G. Branden Robinson wrote:
> > If that's the case, I wonder why you're concerned about a tiny
> > fraction of a tiny fraction of people not being able to display those
> > characters..?
> 
> Because I think a lot of the occurrences of someone staring at grout are
> going to come from people attempting to troubleshoot problems.  

Hi Branden,

I think the incidence of staring at grout among our users is very small, and 
without a good knowledge of grout commands may be restricted to just searching 
for the text where their document is not behaving as they want. Without 
allowing UTF-8 in the grout, even this would prove difficult:-

printf ".pdfbookmark 1 Καλά χριστούγεννα" | test-groff -Tpdf -k -fTINO -Z
x T pdf
x res 72000 1 1
x init
p1
x X ps:exec [/Dest /pdf:bm1 /View [/FitH 5001 u] /DEST pdfmark
x X ps:exec [/Dest /pdf:bm1 /Title (\[u039A]\[u03B1]\[u03BB]\[u03AC] \[u03C7]\
[u03C1]\[u03B9]\[u03C3]\[u03C4]\[u03BF]\[u03CD]\[u03B3]\[u03B5]\[u03BD]\
[u03BD]\[u03B1]) /Level 1 /OUT pdfmark
x trailer
V792000
x stop

Where has my Greek "Happy Christmas" gone? Wouldn't it be better to have:-

x X ps:exec [/Dest /pdf:bm1 /Title (Καλά χριστούγεννα) /Level 1 /OUT pdfmark

In the grout?

> The very
> first time they see the output format may be when they are in a
> frustrating situation.  Under such circumstances, a representation
> format that will work even over a serial line to a bad DEC VT100
> emulator is a good thing to have.  Rendering a blue  and a brown  is🎈 
> not a high priority.

Unless it is the 🎈  which is mis-behaving, and you want to find 
it in the grout.

> Now, it is true that sometimes, the nature of the problem people will be
> troubleshooting will in fact have something to do with correct glyph
> selection.  Not all the time.  Maybe not even most of the time.  

True, but finding the correct part of the grout to examine will require 
finding the text around where the problem lies.

To "debug" a misbehaving roff document the first port of call is to turn 
warnings on, then stare at your document for mistakes, use groff debugging 
tools (which you have added to), judicious use of .tm to check register values 
are what you expect (and in the correct units), check bracket use in numeric 
expressions (groff's arithmetic parser may not behave the way you are used 
to).

All this can be done without a peek at grout, and will find most problems. If 
none of the above can solve the issue, its always open to provide an example 
of the issue to this list. Where grout examination comes into its own is 
checking for real bugs in the code, and gives a clean division between blaming 
troff or the output driver for the problem. This is not something our average 
users do.

I would class Peter as an expert in groff language, I wonder how often he 
looks at grout to tackle issues with mom. (Of course he might say "all the 
time" at which point I would have to concede it may be more prevalent than I 
suspect!!).

> If
> you're staring at grout, I suspect output positioning problems or, as
> we've seen recently with Peter Schaffter's novel use of `char`, tricky
> sequencing issues involving the asynchrony of the command stream are
> more likely.

Yes, this was a bug in troff, and as I said above, grout is essential for 
pinning down bugs, but that is what we are for. Assuming a document issue is 
not down to a code bug, but rather a mistake by the user, the user will have 
more luck finding the issue by doing the things suggested above, i.e. at the 
document level, rather than once removed in grout.

> But, some of the time, sure, yes.  In those cases, depicting 🤡s and
> 🎈s [gratuitous 8-bit microcomputer game reference for the aged] in a
> self-representing manner would be nice.  Thus my openness to it being a
> dynamically configurable choice.

You could check the users locale preferences and encode appropriately.

> > [1] This was your stated reason for not committing "stringhex" from my
> > branch, even though I told you I had a version of pdf.tmac which did
> > not pollute the grout file with hex,
> 
> Does this mean we agreed that emitting hexadecimal sequences in the way
> "stringhex" did was not great for readability?  Its "pollution" was not
> even limited to grout, but showed up inside the formatter too.
> 
> <https://lists.gnu.org/archive/html/groff/2024-02/msg00027.html>:
>
>>> If I'm debugging using troff and dump the string/macro list, then I
>>> envision it being disheartening to see something like this.
>>>
>>> .pm
>>> PDFLB   9
>>> pdfswitchtopage 32
>>> pdfnote 380
>>> pdf:note-T      57
>>> pdfpause        29
>>> PDFBOOKMARK.VIEW        21
>>> 
pdf:look(0073007500700065007200630061006c006900660072006100670069006c0069007300740069006300650078007000690061006c00690064006f00630069006f007500732602)
>>> 41
>>> pdfmark 31
>>> pdftransition   58
>>> pdfbackground   40
>>> pdfpagenumbering        37
>>> pdfbookmark     1677

Well, I had to laugh at this, apologies if it was not intended as humour. 
First you ignore my statement that I had a version which does not "pollute" 
grout and then point to a .pm which lists a very long but perfectly valid 
register name, which to you conveys no meaning. Then you present your slow 
replacement which uses two registers for the same purpose, pdf:look.id!1 and 
pdf:look.content!1, both of which convey no meaning. (NB When Branden 
committed his linear search the naming convention had changed - pdf:look.id!1 
becomes pdf:bm1.tag and pd:look.content!1 becomes pdf:bm1.val). At  least with 
the hex it is a relatively simple to decode the name:-

$ ./hex.pl 
0073007500700065007200630061006c006900660072006100670069006c0069007300740069006300650078007000690061006c00690064006f00630069006f007500732602
supercalifragilisticexpialidocious☂

And you give an example of using .tm

.tm \*[pdf:look.content!1]
supercalifragilisticexpialidocious\[u2602]

Say, somewhere in a document with a hundred named bookmarks:-

.pdfhref M -N supercalifragilisticexpialidocious☂ Mary Poppins

With your method, in a document with a 100 bookmarks (!1 to !100), how do you 
know which number is the one for "supercalifragilisticexpialidocious☂". I have 
a little chuckle imagining you .tm ing each one until you find the one you 
want. :-)

We are talking about the output of .pm, since this is what concerns you!!

> I regarded the foregoing naming convention as an uncomfortable barrier
> to observability.  Do you disagree?

I'm not sure, at least with the hex you have a chance of understanding the 
internal state (using a simple decode script) since you can deduce which 
string you wish to .tm, with just a number no internal state can be inferred. 
Do you agree?

> > instead you decided to introduce a substandard solution during my
> > sabbatical. When I told you that one particular document was now
> > taking over 13 minutes to produce, you did say you would need to do
> > something about it.
> 
> I haven't been able to reproduce it or anything like it.  I haven't
> experienced any noticeable rendering time degradation at all, and I use
> bleeding-edge groff pretty much daily.  I also haven't heard complaints
> from Alex Colomar, who produces documents even more gigantic than
> groff-man-pages.pdf.

Oh, you've forgotten, I did tell you that the program I wrote for Alex to 
create gigantic documents avoids the need to do your linear lookups:-

==================================================================++======

> Also I rewrote prepare.pl to not use calls to .MR, to make it faster,
> which is why you notice no slow down in the run since Branden released
> code to pdf.tmac, which affected the speed of .MR.

I want to keep an eye on this.  As soon as I observe/reproduce a major
performance hit (with _any_ man page collection), I mean to do something
about it.

==========================================================================

In fact, Alex did ask me to give him a version using .MR calls. After running 
it he did report to us disappointing results, I don't think his current git is 
using the .MR version.

Bleeding edge groff includes my fixes and speedups to the code you committed, 
unfortunately my improvements to your code will stop working when you 
implement this in an.tmac:-

.\" TODO: Construct a hierarchical tag name for (sub)section headings
.\" based on the page identifier (and for subsections, the parent
.\" section).

When this is done the speedup I implemented (not adding bookmarks which are 
not "named", no -T parameter, since if they are not a specifically named 
destination, you won't have any links to that destination, so you don't need 
to add it to the linear list), will have no effect in speeding the loop up, 
because all bookmarks will be named destinations. So running this test is a 
valid indicator to what the speed will be after the above is implemented. My 
speedup is particularly aimed at man page booklets, since only the base level 
bookmarks are named destinations all hierarchical bookmarks below this are un-
named, until the above is implemented. Other document types, particularly mom, 
have named subheadings.

> Can you send me an exhibit that reproduces the problem?

I have an example for you, download file http://chuzzlewit.co.uk/loop.tar.bz
which contains a file containing Alex's man pages using calls to .MR (rather 
than my customisation which avoids any lookup).

The first step is to checkout the git version immediately prior to my speedup 
of your code and apply the small patch (this is just so PRINTSTYLE=1 works 
using your pdf:lookup macro and corrects the flaw in your code which allowed 
duplicate entries for different bookmarks):-

git checkout 55c59df79
patch -p1 < pdf.patch
cd build
make -j
make check
sudo make install

Then run:-

time ( pdfmom --roff -man -petk LMB.prep  > LMB-linear.pdf ) 

Should be enough time for a couple of tinnies, on my system : real  
10m55.999s. (Used to be 13m before I upgraded my desktop).

Then try:-

time ( pdfmom --roff -man -dPRINTSTYLE=1 -petk LMB.prep  > LMB-hash.pdf )  

(This works because the code you committed did not work with mom so you left 
it using the hashing algorithm if PRINTSTYLE was set, so it is useful for 
timing the two methods in your code.)

On my system : real  0m16.623s

You should notice a significant difference in the two run times, and I'd 
welcome judgement on whether a 39x slowdown in producing a real world document 
is acceptable. It would be interesting to see other people's timings for the 
two runs, just in case the disparity is unique to my particular rig. 

In case you may be thinking this real world example is not truly 
representative, it has over 3000 pages which is large, but it has only 14613 
hotspot links, each one a link to another page in the book, but it also has 
1758 external links which are particularly expensive because it has to loop 
over the entire list of 2676 discrete destinations before classifying it as 
external. My car owner's handbook pdf has 194 pages with 17129 hotspot links 
to other pages. So the density of links per page is not particularly 
impressive for the Shadow Man Book.

If you try to build against master, to see the current speedup my changes 
make, you will also notice there are 60 extra pages produced. (This is because 
the margin sizes were changed in September/October, there's now a notably 
large gap after the title line - I don't remember seeing any discussion on 
this change to rendering of man pages - the NEWS item explains it is "to align 
more closely with the traditional implementations of these packages.  Per 
man(7) in the AT&T Unix System III manual (June 1980)". Given the "alignment" 
is not exact I fail to see the advantage in changing "different" to a 
different "different"!

> Also, can you tell me what apart from your complaint about performance
> renders my solution substandard?  Equivalently, assuming I can make my
> solution performant again, would you regard it as standard?  If not,
> why not?

Nothing! If the hash lookup code (which has always been there since Keith 
wrote pdfmark.tmac and I used it in pdf.tmac) is considered the "standard", 
replacing it with code which can take 39x longer could be considered 
substandard. Also there was an error in incrementing the numeric index and a 
missed use case (which is why it did not work with mom).

If you can produce code which performs as well (or better) than the code you 
replaced, of course it would be welcomed. Remember the purpose of stringhex 
was to allow non-ascii to be used in pdf meta-data (e.g. bookmarks).

Cheers

Deri

> Regards,
> Branden
[Prev in Thread]
Current Thread
[Next in Thread]
Re: synchronous and asynchronous grout (was: Novel use of .char), (continued)
Prev by Date: Re: [BUG] "\fC" macro in ox-man.el [9.6.15 (release_9.6.15 @ /usr/share/emacs/29.2/lisp/org/)]
Next by Date: Re: UTF-8 in grout and a performance regression (was: synchronous and asynchronous grout)
Previous by thread: Re: UTF-8 in grout and a performance regression (was: synchronous and asynchronous grout)
Next by thread: Re: UTF-8 in grout and a performance regression (was: synchronous and asynchronous grout)
Index(es):
- Date
- Thread