[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[pdf-devel] Re: Comments to the Encoded Text API
From: |
jemarch |
Subject: |
[pdf-devel] Re: Comments to the Encoded Text API |
Date: |
Sun, 20 Jan 2008 17:13:47 +0100 |
User-agent: |
Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/23.0.50 (i686-pc-linux-gnu) MULE/5.0 (SAKAKI) |
> 5. pdf_text_get_best_encoding function will need specific system
> functions to get the range of unicode covered by each host encoding,
> and if no such function is available in a given operating system, a
> default unicode encoding will be returned.
>
> Remember that this function should return an encoding _actually
> supported_ by the host. If the host support Unicode encoding then it
> always be the best encoding available. If it is not the case the
> function cannot return an unicode encoding.
>
> I think would be good to investigate the availability of the functions
> you need to determine the range covered by a given host encoding in
> Unix, GNU, MacosX and Windows (we need to determine the allowed values
> for pdf_text_host_encoding_t anyway. An email about this follows).
I don't really understand why someone would like to create a
pdf_text_t in a host encoding different than the one used in that
moment by the user/system. Is this really needed? I am talking about
the functions `pdf_text_new_from_host', `pdf_text_get_best_encoding'
and so, where a specific host encoding is passed/returned to/by the
function. Shouldn't the function detect which the host encoding being
used in the system is and just use it? AFAIK, host encoding is just to
receive strings from the user and send strings to the user (not to
store anything in the PDF file, at least in the user's encoding), and
the user expects strings in a single host encoding, which can even be
detected once. Am I right?
Note that we are using Unicode encodings to internally store the text
strings. That means we are covering the entire 31-bit wide Unicode
space.
Now suppose you are using a GNU system. GNU systems support Unicode
encodings, but right now your actual locale uses a Latin1
encoding. If your pdf_text_t variable contain chinese text, for
example, you will need to use a host encoding able to code chinese
characters, if there is one available.
The easiest way to handle these host encoding conversions in GNU/Linux
is the wchar_t type and multi-byte functions. The problem is that
there is no way to get conversions to/from encodings different to the
one specified in the user's locale. To get those other conversions
either the locale should be changed in runtime (not a good idea) or
other utilities like GNU libiconv should be used explicitly.
Hmm. It is not a problem to use GNU libiconv if we are running in a
GNU system, but this would be a problem if running in a POSIX
environment without GNU libraries installed.
Why not just detect the host encoding once, when the program
starts, and use that single encoding in all the operations
involving host encodings (get/set)? That would be perfect to be
able to use wchar_t and multi-byte functions, with no need to call
iconv.
The approach given in the Text Encoding API is quite similar to the
way things are done in Windows OS, where you first have to ask for the
specific ANSI Code Page being used (GetACP) and then use that
identifier in MultiByteToWideChar or WideCharToMultiByte functions.
The equivalent two-step approach in GNU/Linux would be done with
nl_langinfo (to get the name of the encoding set in the locale) and
GNU libiconv for the conversions, so it's possible, but not sure if
it's really needed.
So the problem remains for POSIX systems, isnt it?.
What do people think about this?