|
From: | Maxim Kouznetsov |
Subject: | [bug-gnu-libiconv] Possible CP932 conversions bug |
Date: | Mon, 12 Dec 2016 23:44:36 +0000 |
Hello, While testing codepage conversions, I came across the following discrepancy: when converting from CP932 to UTF-16 certain characters get converted into different unicode on Linux (using iconv) and Mac (using libiconv). Looking at some CP932
to Unicode tables online it appears that the Linux conversions are consistent with those tables, while the libiconv uses visually similar characters, but with different codes from the ones found in the aforementioned tables. As far as I can tell the issue happens with following characters: CP932 0x8160 -> Output [0x301C], Expected [0xFF5E] // Wavy dash CP932 0x8161 -> Output [0x2016], Expected [0x2225] // Vertical double line CP932 0x817C -> Output [0x2212], Expected [0xFF0D] // A dash CP932 0x8191 -> Output [0x00A2], Expected [0xFFE0] // Cent sign CP932 0x8192 -> Output [0x00A3], Expected [0xFFE1] // Pound sign CP932 0x81CA -> Output [0x00AC], Expected [0xFFE2] // Logical "not" sign To easily reproduce the problem for individual characters I used the following test program:
(exactly the same cpp file compiled and ran on Ubuntu 16.10, and El Capitan but got two different outputs) #include <stdio.h> #include <iconv.h> #include <errno.h> void printerror() { switch(errno) { case EILSEQ: printf("Invalid multibyte sequence in input\n"); break; case EINVAL: printf("Incomplete multibyte sequence in input\n"); break; case E2BIG: printf("Output buffer is out of room\n"); break; default: printf("Generic Error\n"); break; } } // // Testing conversion of a CP932 character (0x8160) to UTF16 (Little Endian) // int main() { const int SRCBYTES = 3; const int OUTBYTES = 4; iconv_t conv = iconv_open("UTF-16LE", "CP932"); char a_source[SRCBYTES] = {}; char a_output[OUTBYTES] = {}; size_t i_sourcelen = SRCBYTES; size_t i_outputlen = OUTBYTES; //CP932 character 0x8160 - FULLWIDTH TILDE a_source[0] = 0x81; a_source[1] = 0x60; a_source[2] = 0; char* p_source = &a_source[0]; char* p_output = &a_output[0]; int* p_codepoint = (int*)p_output; int ret = iconv(conv, &p_source, &i_sourcelen, &p_output, &i_outputlen); if(ret != -1) { printf("srcbytes [%ld]\noutbytes [%ld]\n", i_sourcelen, i_outputlen); // Final output should be 0xFF5E according to various tables, such as
// http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT printf("UTF16 code point [0x%X]\n", *p_codepoint); } else { printerror(); } iconv_close(conv); return 0; } Please let me know if this is expected behavior due to some factors I’m not aware of. Thank you for your consideration. Maxim Kouznetsov Computer Scientist | Simba Technologies Inc.
| A Magnitude Software Company |
[Prev in Thread] | Current Thread | [Next in Thread] |