bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Need help on UTF-8 to asian conversion (cp92, 936


From: Hadrien Dussuel
Subject: Re: [bug-gnu-libiconv] Need help on UTF-8 to asian conversion (cp92, 936, 949, 950)
Date: Tue, 3 Mar 2015 16:59:47 +0100

I would like to thank you for the explanation. That was clear!
I can't integrate calls to iconv headers in the DLL at it would need extra DLLs for all users. However, I have discovered that my problem wasn't caused by the iconv function but by the fact the decoding is hardcoded in the executable. The problem is now solved.

Thank you for your support!
Sincerely,
Hadrien

2015-02-26 9:57 GMT+01:00 Daiki Ueno <address@hidden>:
Hadrien Dussuel <address@hidden> writes:

> Now, we're are implementing asian languages support but i'm quite lost
> with the conversion function. The asian characters are oftenly coded
> on 3 bytes, and the game will read each byte as a char. I'm trying to
> adapt the iconv function to gather the three chars (to make a wchar)
> and guess the asian char.
>
> Let's take an example with a korean string: 아브라함 (UTF8).
> The game reads ì•„ë¸Œë ¼í•¨.
> The following adresses are:
> 아 : EC 95 84
> 브 : EB B8 8C
> 라 : EB 9D BC
> 함 : ED 95 A8
>
> The string is read as follow: ì•„ë¸Œë ¼í•¨
> ì : EC
> • : 95
> „ : 84
> ë : EB
> ¸ : B8
> Π: 8C
> ë : EB
> 9D (unprintable)
> ¼ : BC
> í : ED
> • : 95
> ¨ : A8
>
> Now, here is the original iconv function:
> static int
> cp949_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)

[snip the code]

> It only expects 2 chars/bytes and we have 3, so i don't understand
> what should i do to process the multibytes into a wchar. For the
> example, EC 95 84, how to handle the conversion ? The first byte is
> still important as EB 95 84, ED 95 84, are also characters...

There seems some confusion between Coded Character Set (CCS) and
Character Encoding Scheme (CES):
http://tools.ietf.org/html/rfc2978#section-1.4

Let's take "아" as an example.  The character is given a code point
value C544 (in hex) in ISO 10646 aka Unicode, which is a CCS.  The ISO
10646 code point can be encoded as a three-byte sequence EB 95 84 in
UTF-8, which is a CES.

On the other hand, there are other CCS/CES combinations which can
represent the character.  The same character is given a code point value
3E46 in KS X 1001, which is a CCS.  The KS X 1001 code point is encoded
as a two-byte sequence BE C6 in CP949, which is a CES capable of
encoding KS X 1001 code points.

In order to convert UTF-8 encoded Korean characters into CP949, one
would decode them to ISO 10646 code points, then convert those code
points into KS X 1001 code points, and finally encode them in CP949.

However, with iconv functions, you normally don't need to worry about
it.  You can just create iconv_t object with:

  iconv_open ("CP949", "UTF-8");

and feed a byte sequence with the iconv() function.

Hope this helps,
--
Daiki Ueno


reply via email to

[Prev in Thread] Current Thread [Next in Thread]