bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Trouble converting to Japanese charsets


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Trouble converting to Japanese charsets
Date: Fri, 6 Nov 2009 10:25:12 +0100
User-agent: KMail/1.9.9

Hi,

Jeff Diehl wrote:
> I am having trouble converting the string "配信リスト名テスト�" from UTF-8 to
> SHIFT-JIS, EUC-JP and ISO-2022-JP using libiconv (version 1.13.1).
> Here is a hex representation of the source string:
> 
> $ xxd utf8.txt
> 0000000: e985 8de4 bfa1 e383 aae3 82b9 e383 88e5  ................
> 0000010: 908d e383 86e3 82b9 e383 88e2 91a0       ..............

Or, to reproduce it:
  $ printf 
'\xe9\x85\x8d\xe4\xbf\xa1\xe3\x83\xaa\xe3\x82\xb9\xe3\x83\x88\xe5\x90\x8d\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe2\x91\xa0'
 > utf8.txt

> The problem seems to be the "circled digit one" character (Unicode
> 0x2460).   Can you please explain why these conversion fail?

This fails because the character U+2460 is not in the target encoding.

For a reference to the various Japanese encodings, please refer to
http://www.haible.de/bruno/charsets/conversion-tables/Japanese.html

> I was expecting to see libiconv generate the following strings:
> 
> $ xxd sjis.txt
> 0000000: 947a 904d 838a 8358 8367 96bc 8365 8358  .z.M...X.g...e.X
> 0000010: 8367 8740                                .g.@

0x8740 is not in the Shift_JIS range: In Shift_JIS there are no
characters between 0x84BE and 0x889F.

Probably you mean the CP932 encoding, which is the Shift_JIS-like
encoding used by Windows. libiconv supports it:

$ iconv -f UTF-8 -t CP932 < utf8.txt | hd
000000  94 7A 90 4D 83 8A 83 58 83 67 96 BC 83 65 83 58  .z.M...X.g...e.X
000010  83 67 87 40                                      .g.@

> $ xxd euc-jp.txt
> 0000000: c7db bfae a5ea a5b9 a5c8 ccbe a5c6 a5b9  ................
> 0000010: a5c8 ada1                                ....

0xada1 is not in the EUC-JP range: In EUC-JP there are no characters
between 0xA8C0 and 0xB0A1.

I don't know which of the many EUC-JP variants you were expecting.
I would recommend to stick with plain standardized EUC-JP, if you
value interoperability and don't like loss of data.

> $ xxd 2022.txt
> 0000000: 1b24 4247 5b3f 2e25 6a25 3925 484c 3e25  .$BG[?.%j%9%HL>%
> 0000010: 4625 3925 482d 211b 284a                 F%9%H-!.(J

ISO-2022-JP is not suitable for Japanese: It does not even contain
Katakana characters. I don't know why you would want to use this
encoding. Nobody uses it.

An encoding similar to ISO-2022-JP that is still sometimes used for
email or web pages is ISO-2022-JP-2. libiconv supports it:

$ iconv -f UTF-8 -t ISO-2022-JP-2 < utf8.txt | hd
000000  1B 24 42 47 5B 3F 2E 25 6A 25 39 25 48 4C 3E 25  .$BG[?.%j%9%HL>%
000010  46 25 39 25 48 1B 24 41 22 59 1B 28 42           F%9%H.$A"Y.(B

Your 2022.txt is not valid in any known encoding.
$ printf 
'\x1b\x24\x42\x47\x5b\x3f\x2e\x25\x6a\x25\x39\x25\x48\x4c\x3e\x25\x46\x25\x39\x25\x48\x2d\x21\x1b\x28\x4a'
 > 2022.txt
$ iconv -f ISO-2022-JP -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:18: cannot convert
$ iconv -f ISO-2022-JP-1 -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:18: cannot convert
$ iconv -f ISO-2022-JP-2 -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:18: cannot convert
$ iconv -f ISO-2022-CN -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:0: cannot convert
$ iconv -f ISO-2022-CN-EXT -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:0: cannot convert
$ iconv -f ISO-2022-KR -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:0: cannot convert

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]