groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PDF outline not capturing Cyrillic text


From: Robin Haberkorn
Subject: Re: PDF outline not capturing Cyrillic text
Date: Sun, 4 Feb 2024 06:57:22 +0300

Regarding cyrillic characters in PDF outlines, I think I got a few
insights today.

It turns out that the pdfmarks in the postscript code are "text strings"
according to the PDF specs, that is either a PDFDocEncoding or
UTF-16BE with a leading byte-order marker (cf. PDF Reference 1.7).
A PDFDocEncoding is basically latin1 it seems.
This explains why the current code in MOM works with western European
languages.
Now, in order to include cyrillic, you will have to reencode whatever
encoding Groff uses and passes to the postprocessor - which will
subsequently end up in the postscript code - to UTF-16BE.
Everything needs to be hex-encoded and enclosed in sharp
brackets (<FEFF....>).

In the most hacky case, this could be done by a script on the
postscript code generated by `pdfroff --emit-ps`. As a proof of concept
Here's an incomplete, but somewhat working version in SciTECO:

    sciteco -e "16,0ED @EB/document.ps/ <@S|/Title (|; -D @I|<FEFF| .(@S|) 
/OUT|6R).@EC{iconv -f KOI8-R -t UTF-16BE | hexdump -e '1/1 \"%02X\"'} @I/>/ D> 
@EW//"

This assumes that the Groff encoding is KOI8-R, which I chose as an
intermediate format in order to enable Russian hyphenation
(but that does not work unfortunately).
It should be rewritten into a Python or Perl script using some
iconv wrapper or ideally pdfroff itself could do it.
The script could even interpret Groff Unicode escapes generated by preconv
and convert them back to plain Unicode before writing out everything in UTF16.

I will probably just use such a hack for my purposes.

What's the status of pdfroff anyway? I read that it is more or less
deprecated and we should all use `groff -Tpdf` instead.
Actually, pdfmom should work with ms as well, actually uses
gropdf and should perform the necessary multipass processing
for pdfhref forward-references to work.
Will try this next!

Best regards,
Robin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]