bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 答复: [bug-gnu-libiconv] iconv Bug report


From: Bruno Haible
Subject: Re: 答复: [bug-gnu-libiconv] iconv Bug report
Date: Wed, 20 Jun 2007 01:20:31 +0200
User-agent: KMail/1.5.4

Dear 刘军民,

> 1. The byte sequence 0xa3 0xa0 is not valid GBK, and have many similar byte
> sequence, example: 0xa140 0xfe65 etc.
> I wrote a PHP script to print iconv unable convert valid GBK byte sequence:
> <?php
> for($a = 0x81;$a<=0xfe;$a++)
> {
>         for($b=0x40;$b<=0xfe;$b++)
>         {
>                 if($b==0x7f) continue;

Here you assume that every byte sequence with
   first byte = 0x81..0xFE,
   second byte = 0x40..0x7E, 0x80..0xFE
is a valid GBK character.

How can you assert this? Do you have a paper copy of GB 13000.1-1993 ?

The information I have on this point is unclear:
  - Ken Lunde's book, p. 170, says that GBK has user-defined areas at
    0x{A1..A7}{40..A0}, 0x{AA..AF}{A1..FE}, 0x{F8..FE}{A1..FE}.
  - But as you can see on
      http://www.haible.de/bruno/charsets/conversion-tables/GB2312.html
    some conversion tables don't have these user-defined areas.
  - In particular, the original CP936 table on ftp.unicode.org
      http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
    does not have these user-defined areas.

> 2. Though the byte sequence 0xa3 0xa0 and other is not valid GBK, but it
> often appear in Chinese system.

What does this byte sequence mean, then? Which word do the people mean
when they type it?

I would guess that the computers on which you can type this byte sequence
are not using GBK, but either CP936 (if it's Windows) or GB18030 (if it's
Unix).

> 3. If texts contain undefined GBK byte sequence, use iconv convert it will
> get error text. The reason is GBK charset is double byte charset, but iconv
> only ignored first byte of double undefined GBK byte sequence,

This is the safest strategy for getting into a reasonable state at some
point. Recall that GBK is not a "double byte" encoding, but a mixed
single-byte / double-byte encoding.

>       The attach file test.txt will be convert error.
> Both  "iconv -f gbk -t utf-8 test.txt" and "iconv -f gb18030 -t utf-8
> test.txt" error.

Are you sure you are using the newest libiconv release (1.11)? I get

$ iconv -f GB18030 -t UTF-8 < test.txt > /dev/null 
$ echo $?
0
$ iconv -f GB18030 -t UTF-8 < test.txt | hd
000000  E6 88 91 EE 97 A5 E6 88 91 0A                    ..........

> 4. My idea is convert undefined GBK byte sequence to double byte space
> (U+3000). Attach file iconv-gbk.patch is a simple patch by I.

The question is: where should conversion error handling (insertion of
replacement characters etc.) take place?

Such replacement are of course application dependent. But the POSIX standard
for iconv does not foresee customization of the error handling of iconv().
So it really has to be in the application program, not in the iconv()
function. (Remember that usually iconv() is in libc. You cannot hack it
there.)

- At the program level, GNU libiconv's iconv program has a --byte-subst
  option, which you can use to specify an U+3000 replacement:

    $ iconv -f GBK -t UTF-8 --byte-subst=' ' < test.txt
    我 犖 

- At the C programmer level, the gnulib module 'striconveh' provides
  error handling options.
  
http://cvs.sv.gnu.org/viewvc/*checkout*/gnulib/lib/striconveh.h?root=gnulib&content-type=text/plain
  (It does not allow U+3000 as replacement, only the question mark.)

- I don't know about PHP's iconv binding, but it would make sense if it
  had customizations of the error handling as well.

Best regards,

                Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]