groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why do Unicode Characters in the PDF Outline show up as, for example


From: T. Kurt Bond
Subject: Re: Why do Unicode Characters in the PDF Outline show up as, for example, [u1FOA1]?
Date: Mon, 10 Aug 2020 09:39:03 -0400

Got it, thanks!

On Mon, Aug 10, 2020 at 8:24 AM Deri <deri@chuzzlewit.myzen.co.uk> wrote:

> On Sunday, 9 August 2020 05:58:15 BST T. Kurt Bond wrote:
>
> > Anyway, in the output file (attached to this e-mail) the unicode
>
> > characters show up fine in the body text fine, but in the PDF Outline the
>
> > characters show up as [uXXXX] text instead of the actual character. Does
>
> > anybody know why this is? I know that if I do something similar for
>
> > Heirloom troff the PDF Outline *does* contain the Unicode characters.
>
>
>
> In the PDF Reference text strings are defined as:-
>
>
>
>
> =============================================================================
>
>
>
> 3.8.1 Text Strings
>
>
>
> Certain strings contain information that is intended to be human-readable,
> such
>
> as text annotations, bookmark names, article names, document information,
> and
>
> so forth. Such strings are referred to as text strings. Text strings are
> encoded in
>
> either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
>
> superset of the ISO Latin 1 encoding and is documented in Appendix D.
> Unicode
>
> is described in the Unicode Standard by the Unicode Consortium (see the
> Bibli-
>
> ography).
>
>
>
> For text strings encoded in Unicode, the first two bytes must be 254
> followed by
>
> 255, representing the Unicode byte order marker, U+FEFF . (This sequence
> con-
>
> flicts with the PDFDocEncoding character sequence thorn ydieresis, which
> is un-
>
> likely to be a meaningful beginning of a word or phrase.) The remainder of
> the
>
> string consists of Unicode character codes, according to the UTF-16
> encoding
>
> specified in the Unicode standard, version 2.0. Commonly used Unicode
> values
>
> are represented as 2 bytes per character, with the high-order byte
> appearing first
>
> in the string.
>
>
>
>
> ==============================================================================
>
>
>
> Since groff works internally with ascii, the \[uXXXX] form of input is
> converted to a separate node which is a named glyph in the appropriate
> font. In the groff_out format this can be seen as "Cu2640", for example,
> which tells the output driver to look for the named glyph in a particular
> font.
>
>
>
> This is only true for text which is destined for the output stream,
> parameters to .pdfhref are just treated as ascii, i.e PDFDocEncoding.
>
>
>
> Cheers
>
>
>
> Deri
>
>
>


-- 
T. Kurt Bond, tkurtbond@gmail.com, https://tkurtbond.github.io


reply via email to

[Prev in Thread] Current Thread [Next in Thread]