bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Big5-HKSCS


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Big5-HKSCS
Date: Thu, 25 Nov 2010 00:40:38 +0100
User-agent: KMail/1.9.9

oCameLo wrote:
> These're different between CP950 and Big5-HKSCS. I'm not sure which
> one is correct, but CP950 is more likely, because 0xA244, 0xA246,
> 0xA247 shouldn't map to single-width characters. Also, \uFF0F and
> \uFF3C are very common characters, so Big5-HKSCS in libiconv might not
> be able to work with many CP951 files.
> 
> Could you please tell me Big5-HKSCS in libiconv base on which kind of
> Big5, why not CP950?
> 
> Thanks for your work, very much.
> 
> 
> Big5        CP950_TO_UCS        HKSCS_TO_HKSCS
> 0xA145        0x2027 [‧]        0x2022 [•]
> 0xA14E        0xFE51 [﹑]        0xFF64 [、]
> 0xA15A        0x2574 [╴]        nil
> 0xA1C2        0x00AF [¯]        0x203E [‾]
> 0xA1C3        0xFFE3 [ ̄]        nil
> 0xA1C5        0x02CD [ˍ]        nil
> 0xA1E3        0xFF5E [~]        0x223C [∼]
> 0xA1F2        0x2295 [⊕]        0x2641 [♁]
> 0xA1F3        0x2299 [⊙]        0x2609 [☉]
> 0xA1FE        0xFF0F [/]        nil
> 0xA240        0xFF3C [\]        nil
> 0xA241        0x2215 [∕]        0xFF0F [/]
> 0xA242        0xFE68 [﹨]        0xFF3C [\]
> 0xA244        0xFFE5 [¥]        0x00A5 [¥]
> 0xA246        0xFFE0 [¢]        0x00A2 [¢]
> 0xA247        0xFFE1 [£]        0x00A3 [£]
> 0xA2CC        0x5341 [十]        nil
> 0xA2CE        0x5345 [卅]        nil
> 0xA3E1        0x20AC [€]        nil
> 0xF9FE        0x2593 [▓]        0xFFED [■]

The conversion table in libiconv is based on the Big5 conversion table that
was found on ftp.unicode.org in 1999 / 2000.

You're saying "CP950 is more likely", but the justification you give is very
weak. There are many many variants of Big5, see
  http://www.haible.de/bruno/charsets/conversion-tables/Big5.html
  http://www.haible.de/bruno/charsets/conversion-tables/BIG5-HKSCS.html

Also, where did you get your column "HKSCS_TO_HKSCS" from?

The file e_hkscs_2008.pdf that can be downloaded from
http://www.ogcio.gov.hk/ccli/eng/hkscs/document.html
does not explicitly state which version of Big5 is meant to be the base.
The only indication I can find is the table in section 3.4, which gives
the expected number of characters in three blocks. When I compare this
with the character counts in the various libiconv mapping tables, I get this:

Range A140..A3BF, expect 408 characters.

$ LC_ALL=C grep -c '^0x\(A[1-2]\|A3[4-B]\)' BIG*.TXT CP950.TXT | grep -v :0'$'
BIG5-2003.TXT:408
BIG5-HKSCS-1999.TXT:401
BIG5-HKSCS-2001.TXT:401
BIG5-HKSCS-2004.TXT:401
BIG5-HKSCS-2008.TXT:401
BIG5.TXT:401
CP950.TXT:408

Range A440..C67E, expect 5401 characters.

$ LC_ALL=C grep -c '^0x\(A[4-F]\|B\|C[0-5]\|C6[4-7]\)' BIG*.TXT CP950.TXT | 
grep -v :0'$'
BIG5-2003.TXT:5401
BIG5-HKSCS-1999.TXT:5401
BIG5-HKSCS-2001.TXT:5401
BIG5-HKSCS-2004.TXT:5401
BIG5-HKSCS-2008.TXT:5401
BIG5.TXT:5401
CP950.TXT:5401

Range C940..F9D5, expect 7652 characters.

$ LC_ALL=C grep -c '^0x\(C[9-F]\|[DE]\|F[0-8]\|F9[4-C]\|F9D[0-5]\)' BIG*.TXT 
CP950.TXT | grep -v :.'$'
BIG5-2003.TXT:7652
BIG5-HKSCS-1999.TXT:7652
BIG5-HKSCS-2001.TXT:7652
BIG5-HKSCS-2004.TXT:7652
BIG5-HKSCS-2008.TXT:7652
BIG5.TXT:7652
CP950.TXT:7652

Looking at the first block, it means that CP950 and BIG5-2003 are the most
likely ones that were meant. But these are different as well:

$ ./table-diff /tmp/CP950.TXT /tmp/BIG5-2003.TXT
***************
*** 22,24 ****
  0xA155        0xFF5C  #       FULLWIDTH VERTICAL LINE
! 0xA156        0x2013  #       EN DASH
  0xA157        0xFE31  #       PRESENTATION FORM FOR VERTICAL EM DASH
--- 22,24 ----
  0xA155        0xFF5C  #       FULLWIDTH VERTICAL LINE
! 0xA156        0x2015  #       HORIZONTAL BAR
  0xA157        0xFE31  #       PRESENTATION FORM FOR VERTICAL EM DASH
***************
*** 96,98 ****
  0xA1C1        0x2105  #       CARE OF
! 0xA1C2        0x00AF  #       MACRON
  0xA1C3        0xFFE3  #       FULLWIDTH MACRON
--- 96,98 ----
  0xA1C1        0x2105  #       CARE OF
! 0xA1C2        0x203E  #       OVERLINE
  0xA1C3        0xFFE3  #       FULLWIDTH MACRON
***************
*** 223,228 ****
  0xA2A3        0x256F  #       BOX DRAWINGS LIGHT ARC UP AND LEFT
! 0xA2A4        0x2550  #       BOX DRAWINGS DOUBLE HORIZONTAL
! 0xA2A5        0x255E  #       BOX DRAWINGS VERTICAL SINGLE AND RIGHT DOUBLE
! 0xA2A6        0x256A  #       BOX DRAWINGS VERTICAL SINGLE AND HORIZONTAL 
DOUBLE
! 0xA2A7        0x2561  #       BOX DRAWINGS VERTICAL SINGLE AND LEFT DOUBLE
  0xA2A8        0x25E2  #       BLACK LOWER RIGHT TRIANGLE
--- 223,228 ----
  0xA2A3        0x256F  #       BOX DRAWINGS LIGHT ARC UP AND LEFT
! 0xA2A4        0x2501  #       BOX DRAWINGS HEAVY HORIZONTAL
! 0xA2A5        0x251D  #       BOX DRAWINGS VERTICAL LIGHT AND RIGHT HEAVY
! 0xA2A6        0x253F  #       BOX DRAWINGS VERTICAL LIGHT AND HORIZONTAL HEAVY
! 0xA2A7        0x2525  #       BOX DRAWINGS VERTICAL LIGHT AND LEFT HEAVY
  0xA2A8        0x25E2  #       BLACK LOWER RIGHT TRIANGLE
***************
*** 263,267 ****
  0xA2CB        0x3029  #       HANGZHOU NUMERAL NINE
! 0xA2CC        0x5341  #       <CJK Ideograph>
! 0xA2CD        0x5344  #       <CJK Ideograph>
! 0xA2CE        0x5345  #       <CJK Ideograph>
  0xA2CF        0xFF21  #       FULLWIDTH LATIN CAPITAL LETTER A
--- 263,267 ----
  0xA2CB        0x3029  #       HANGZHOU NUMERAL NINE
! 0xA2CC        0x3038  #       HANGZHOU NUMERAL TEN
! 0xA2CD        0x3039  #       HANGZHOU NUMERAL TWENTY
! 0xA2CE        0x303A  #       HANGZHOU NUMERAL THIRTY
  0xA2CF        0xFF21  #       FULLWIDTH LATIN CAPITAL LETTER A

So, really, it's ambiguous.

I won't make a backward incompatible change to libiconv and glibc until there
is _clear_ evidence which variant of BIG5 is meant.

Bruno



reply via email to

[Prev in Thread] Current Thread [Next in Thread]