bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Possible CP932 conversions bug


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Possible CP932 conversions bug
Date: Tue, 13 Dec 2016 17:58:14 +0100
User-agent: KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; )

Hello Maxim,

> While testing codepage conversions, I came across the following discrepancy: 
> when
> converting from CP932 to UTF-16 certain characters get converted into 
> different
> unicode on Linux (using iconv) and Mac (using libiconv). Looking at some CP932
> to Unicode tables online it appears that the Linux conversions are consistent
> with those tables, while the libiconv uses visually similar characters, but
> with different codes from the ones found in the aforementioned tables.
> 
> As far as I can tell the issue happens with following characters:
> 
> CP932  0x8160 -> Output [0x301C], Expected [0xFF5E]  // Wavy dash
> CP932  0x8161 -> Output [0x2016], Expected [0x2225]  // Vertical double line
> CP932  0x817C -> Output [0x2212], Expected [0xFF0D]  // A dash
> CP932  0x8191 -> Output [0x00A2], Expected [0xFFE0]  // Cent sign
> CP932  0x8192 -> Output [0x00A3], Expected [0xFFE1]  // Pound sign
> CP932  0x81CA -> Output [0x00AC], Expected [0xFFE2]  // Logical "not" sign

I confirm: These differences exist. With the tools from [1] and the tables
from [2], I get

$ ./table-diff glibc-2.23-iconv/CP932.TXT libiconv-1.14/CP932.TXT
***************
*** 160,163 ****
  0x815F        0xFF3C  #       FULLWIDTH REVERSE SOLIDUS
! 0x8160        0xFF5E  #       FULLWIDTH TILDE
! 0x8161        0x2225  #       PARALLEL TO
  0x8162        0xFF5C  #       FULLWIDTH VERTICAL LINE
--- 160,163 ----
  0x815F        0xFF3C  #       FULLWIDTH REVERSE SOLIDUS
! 0x8160        0x301C  #       WAVE DASH
! 0x8161        0x2016  #       DOUBLE VERTICAL LINE
  0x8162        0xFF5C  #       FULLWIDTH VERTICAL LINE
***************
*** 188,190 ****
  0x817B        0xFF0B  #       FULLWIDTH PLUS SIGN
! 0x817C        0xFF0D  #       FULLWIDTH HYPHEN-MINUS
  0x817D        0x00B1  #       PLUS-MINUS SIGN
--- 188,190 ----
  0x817B        0xFF0B  #       FULLWIDTH PLUS SIGN
! 0x817C        0x2212  #       MINUS SIGN
  0x817D        0x00B1  #       PLUS-MINUS SIGN
***************
*** 208,211 ****
  0x8190        0xFF04  #       FULLWIDTH DOLLAR SIGN
! 0x8191        0xFFE0  #       FULLWIDTH CENT SIGN
! 0x8192        0xFFE1  #       FULLWIDTH POUND SIGN
  0x8193        0xFF05  #       FULLWIDTH PERCENT SIGN
--- 208,211 ----
  0x8190        0xFF04  #       FULLWIDTH DOLLAR SIGN
! 0x8191        0x00A2  #       CENT SIGN
! 0x8192        0x00A3  #       POUND SIGN
  0x8193        0xFF05  #       FULLWIDTH PERCENT SIGN
***************
*** 246,248 ****
  0x81C9        0x2228  #       LOGICAL OR
! 0x81CA        0xFFE2  #       FULLWIDTH NOT SIGN
  0x81CB        0x21D2  #       RIGHTWARDS DOUBLE ARROW
--- 246,248 ----
  0x81C9        0x2228  #       LOGICAL OR
! 0x81CA        0x00AC  #       NOT SIGN
  0x81CB        0x21D2  #       RIGHTWARDS DOUBLE ARROW

It seems like the glibc variant is more closely based on the tables
published by Microsoft
  unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP932.TXT
  microsoft-2005/CP932.TXT
whereas the libiconv variant is more closely based on the the JISX0208 standard
  unicode.org-mappings/EASTASIA/JIS/SHIFTJIS.TXT

It's hard to say which of the two is "better" today...

Bruno

[1] http://haible.de/bruno/charsets/conversion-tables/tools.html
[2] http://haible.de/bruno/charsets/conversion-tables/Shift_JIS.html




reply via email to

[Prev in Thread] Current Thread [Next in Thread]