freetype
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ft] FW: Getting the charcode Value when the Glyph ID is known


From: mpsuzuki
Subject: Re: [ft] FW: Getting the charcode Value when the Glyph ID is known
Date: Wed, 4 May 2011 21:55:57 +0900

On Mon, 2 May 2011 09:19:37 +0000
"Balraj Balakrishnan, Integra-PDY, IN"
<address@hidden> wrote:
>As am new to freetype and all these font stuffs, I couldn't rather
>frame my requirement in a right manner. I shall be making an another
>attempt to bring about much more clarity in what I really want from
>freetype:

OK. I think there is not special manner specific to
this list, the clarification of input, process and
output is important in any mailing list of open sources.

>1.  The scenario here is, we are trying to convert the source PDF into
>an HTML, while doing this there are many fonts in the PDF which are
>extracted or mapped to a wrong character.

I see. What software translating from PDF to HTML
you're using? Could you post (or upload to any web
site) a sample PDF that you have some issue?

Basically, an elementary font object in PDF (a data
segment which you spliced from PDF and pass to
FT_New_Face()) is not expected to hold an interface
to character encoding. For the relationship between
glyph index (or glyph name) and the character code,
/Encoding or /ToUnicode elements in wrapping font
object in PDF (which refers its elementary font object
via /BaseFont object). Referrer's /Encoding dictionary
can override the built-in encoding info in the referred
font.

I think there are existing softwares like pdftohtml
which do such work in good level.

>So we are extracting the font files from the PDF, to
>convert glyph's (Symbols, Unicode) in the font file
>as an image and replace the wrongly extracted characters
>/Symbols/Unicode in the HTML file with the image.

As I've written in above, extacted font file is insufficient
resource to guess the codespoint for the glyphs.

>In the above mentioned scenario the image should maintain
>its position in the outline in order place it in an HTML
>file. If you look at the image below the fonts Quote
>right and the Comma is differentiated based on its position
>in a given line.

Do you say that your program (at present) cannot detect
the character code point for the single quote glyph and
the comma glyph from PDF, then you want to guess the
codepoints by checking the indepth of the font?
Does Adobe Acrobat extract the text from your PDF?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]