|
From: | Hadrien Dussuel |
Subject: | Re: [bug-gnu-libiconv] Need help on UTF-8 to asian conversion (cp92, 936, 949, 950) |
Date: | Tue, 3 Mar 2015 16:59:47 +0100 |
[snip the code]Hadrien Dussuel <address@hidden> writes:
> Now, we're are implementing asian languages support but i'm quite lost
> with the conversion function. The asian characters are oftenly coded
> on 3 bytes, and the game will read each byte as a char. I'm trying to
> adapt the iconv function to gather the three chars (to make a wchar)
> and guess the asian char.
>
> Let's take an example with a korean string: 아브라함 (UTF8).
> The game reads ì•„ë¸Œë ¼í•¨.
> The following adresses are:
> 아 : EC 95 84
> 브 : EB B8 8C
> 라 : EB 9D BC
> 함 : ED 95 A8
>
> The string is read as follow: ì•„ë¸Œë ¼í•¨
> ì : EC
> • : 95
> „ : 84
> ë : EB
> ¸ : B8
> Œ : 8C
> ë : EB
> 9D (unprintable)
> ¼ : BC
> í : ED
> • : 95
> ¨ : A8
>
> Now, here is the original iconv function:
> static int
> cp949_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
> It only expects 2 chars/bytes and we have 3, so i don't understand
> what should i do to process the multibytes into a wchar. For the
> example, EC 95 84, how to handle the conversion ? The first byte is
> still important as EB 95 84, ED 95 84, are also characters...
There seems some confusion between Coded Character Set (CCS) and
Character Encoding Scheme (CES):
http://tools.ietf.org/html/rfc2978#section-1.4
Let's take "아" as an example. The character is given a code point
value C544 (in hex) in ISO 10646 aka Unicode, which is a CCS. The ISO
10646 code point can be encoded as a three-byte sequence EB 95 84 in
UTF-8, which is a CES.
On the other hand, there are other CCS/CES combinations which can
represent the character. The same character is given a code point value
3E46 in KS X 1001, which is a CCS. The KS X 1001 code point is encoded
as a two-byte sequence BE C6 in CP949, which is a CES capable of
encoding KS X 1001 code points.
In order to convert UTF-8 encoded Korean characters into CP949, one
would decode them to ISO 10646 code points, then convert those code
points into KS X 1001 code points, and finally encode them in CP949.
However, with iconv functions, you normally don't need to worry about
it. You can just create iconv_t object with:
iconv_open ("CP949", "UTF-8");
and feed a byte sequence with the iconv() function.
Hope this helps,
--
Daiki Ueno
[Prev in Thread] | Current Thread | [Next in Thread] |