bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] iconv in terminal and cpp differs


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] iconv in terminal and cpp differs
Date: Wed, 04 Oct 2017 11:53:59 +0200
User-agent: KMail/5.1.3 (Linux/4.4.0-96-generic; KDE/5.18.0; x86_64; ; )

Hi,

To investigate these issues, it is useful to display the byte sequences in
hexadecimal form. You could use 'od -t x1' to do this; I prefer 'hd',
implemented as
=====================================================================
#!/bin/sh
hexdump -e '"%06.6_ax  " 16/1 "%02X "' -e '"  " 16/1 "%_p" "\n"' "$@"
=====================================================================

> When we run this as command line as shown below
> 
> iconv -f gb18030 -t utf-8 GB18030.txt > utf-8.txt
> 
> utf-8.txt has - 我爱北京天安门,天安门上太阳升

The original bytes that you input in this conversion were:

$ echo '我爱北京天安门,天安门上太阳升' | iconv -t UTF-8 -t GB18030 | hd
000000  CE D2 B0 AE B1 B1 BE A9 CC EC B0 B2 C3 C5 A3 AC  ................
000010  CC EC B0 B2 C3 C5 C9 CF CC AB D1 F4 C9 FD 0A     ...............

> but when are run the attached  source code’s binary by inputting like below 
> in same terminal,
> 
> [Encoder  GB18030_String “FROM”  “TO”]
> Encoder ÎÒ°®±±¾©Ìì°²ÃÅ£¬Ìì°²ÃÅÉÏÌ«ÑôÉý “GB18030” “UTF-8”
> 
> We are getting
> 
> 脦脪掳庐卤卤戮漏脤矛掳虏脙脜拢卢脤矛掳虏脙脜脡脧脤芦脩么脡媒

The bytes that you gave as input in this conversion were:
$ echo '脦脪掳庐卤卤戮漏脤矛掳虏脙脜拢卢脤矛掳虏脙脜脡脧脤芦脩么脡媒' | iconv -t UTF-8 -t GB18030 | hd
000000  C3 8E C3 92 C2 B0 C2 AE C2 B1 C2 B1 C2 BE C2 A9  ................
000010  C3 8C C3 AC C2 B0 C2 B2 C3 83 C3 85 C2 A3 C2 AC  ................
000020  C3 8C C3 AC C2 B0 C2 B2 C3 83 C3 85 C3 89 C3 8F  ................
000030  C3 8C C2 AB C3 91 C3 B4 C3 89 C3 BD 0A           .............

As you can see, there are approximately twice as many bytes here,
and more precisely, the input you gave here is UTF-8 encoded. Look at this:

$ echo '脦脪掳庐卤卤戮漏脤矛掳虏脙脜拢卢脤矛掳虏脙脜脡脧脤芦脩么脡媒' | iconv -t UTF-8 -t GB18030 | iconv -f 
UTF-8 -t ISO-8859-1 | hd
000000  CE D2 B0 AE B1 B1 BE A9 CC EC B0 B2 C3 C5 A3 AC  ................
000010  CC EC B0 B2 C3 C5 C9 CF CC AB D1 F4 C9 FD 0A     ...............

Here we find your original input again!

So, there was an undesired conversion from ISO-8859-1 to UTF-8 on your input.

I would guess that you are on Linux, and this conversion happened when you did a
copy&paste of the snippet, from a file into a terminal window.

Bruno

reply via email to

[Prev in Thread] Current Thread [Next in Thread]