groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PDF outline not capturing Cyrillic text


From: Deri
Subject: Re: PDF outline not capturing Cyrillic text
Date: Tue, 06 Feb 2024 13:39:51 +0000

On Sunday, 4 February 2024 03:57:22 GMT Robin Haberkorn wrote:
> Regarding cyrillic characters in PDF outlines, I think I got a few
> insights today.
> 
> It turns out that the pdfmarks in the postscript code are "text strings"
> according to the PDF specs, that is either a PDFDocEncoding or
> UTF-16BE with a leading byte-order marker (cf. PDF Reference 1.7).
> A PDFDocEncoding is basically latin1 it seems.
> This explains why the current code in MOM works with western European
> languages.
> Now, in order to include cyrillic, you will have to reencode whatever
> encoding Groff uses and passes to the postprocessor - which will
> subsequently end up in the postscript code - to UTF-16BE.
> Everything needs to be hex-encoded and enclosed in sharp
> brackets (<FEFF....>).
> 
> In the most hacky case, this could be done by a script on the
> postscript code generated by `pdfroff --emit-ps`. As a proof of concept
> Here's an incomplete, but somewhat working version in SciTECO:
> 
>     sciteco -e "16,0ED @EB/document.ps/ <@S|/Title (|; -D @I|<FEFF| .(@S|)
> /OUT|6R).@EC{iconv -f KOI8-R -t UTF-16BE | hexdump -e '1/1 \"%02X\"'} @I/>/
> D> @EW//"
> 
> This assumes that the Groff encoding is KOI8-R, which I chose as an
> intermediate format in order to enable Russian hyphenation
> (but that does not work unfortunately).
> It should be rewritten into a Python or Perl script using some
> iconv wrapper or ideally pdfroff itself could do it.
> The script could even interpret Groff Unicode escapes generated by preconv
> and convert them back to plain Unicode before writing out everything in
> UTF16.
> 
> I will probably just use such a hack for my purposes.
> 
> What's the status of pdfroff anyway? I read that it is more or less
> deprecated and we should all use `groff -Tpdf` instead.
> Actually, pdfmom should work with ms as well, actually uses
> gropdf and should perform the necessary multipass processing
> for pdfhref forward-references to work.
> Will try this next!
> 
> Best regards,
> Robin

Hi Robin,

The current gropdf (in the master branch) does support UTF-16BE for pdf 
outlines (see attached pdf), but Branden has not released the other parts to 
make it work! If you can compile and install the current git the applying the 
attached patch should give you what you want.

To apply the patch, cd into the git groff directory and "patch -p1 < path-to-
patch-file", and then run make and install as usual.

I would be very interested in how you get on, and whether it gives you what 
you need. Note that I am assuming you are feeding groff a file in UTF-8 and 
the -k flag. I can see some hyphenation happening, but I don't know if it is 
correct.

Cheers 

Deri

Attachment: master.patch
Description: Text Data

Attachment: Rus2.pdf
Description: Adobe PDF document

Attachment: Rus2.trf
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]