[bug-gnu-libiconv] Possible CP932 conversions bug

Hello,

While testing codepage conversions, I came across the following discrepancy: when converting from CP932 to UTF-16 certain characters get converted into different unicode on Linux (using iconv) and Mac (using libiconv). Looking at some CP932 to Unicode tables online it appears that the Linux conversions are consistent with those tables, while the libiconv uses visually similar characters, but with different codes from the ones found in the aforementioned tables.

As far as I can tell the issue happens with following characters:

CP932 0x8160 -> Output [0x301C], Expected [0xFF5E] // Wavy dash

CP932 0x8161 -> Output [0x2016], Expected [0x2225] // Vertical double line

CP932 0x817C -> Output [0x2212], Expected [0xFF0D] // A dash

CP932 0x8191 -> Output [0x00A2], Expected [0xFFE0] // Cent sign

CP932 0x8192 -> Output [0x00A3], Expected [0xFFE1] // Pound sign

CP932 0x81CA -> Output [0x00AC], Expected [0xFFE2] // Logical "not" sign

To easily reproduce the problem for individual characters I used the following test program:

(exactly the same cpp file compiled and ran on Ubuntu 16.10, and El Capitan but got two different outputs)

#include <stdio.h>

#include <iconv.h>

#include <errno.h>

void printerror()

{

switch(errno)

{

case EILSEQ:

printf("Invalid multibyte sequence in input\n");

break;

case EINVAL:

printf("Incomplete multibyte sequence in input\n");

break;

case E2BIG:

printf("Output buffer is out of room\n");

break;

default:

printf("Generic Error\n");

break;

}

// Testing conversion of a CP932 character (0x8160) to UTF16 (Little Endian)

int main()

{

const int SRCBYTES = 3;

const int OUTBYTES = 4;

iconv_t conv = iconv_open("UTF-16LE", "CP932");

char a_source[SRCBYTES] = {};

char a_output[OUTBYTES] = {};

size_t i_sourcelen = SRCBYTES;

size_t i_outputlen = OUTBYTES;

//CP932 character 0x8160 - FULLWIDTH TILDE

a_source[0] = 0x81;

a_source[1] = 0x60;

a_source[2] = 0;

char* p_source = &a_source[0];

char* p_output = &a_output[0];

int* p_codepoint = (int*)p_output;

int ret = iconv(conv, &p_source, &i_sourcelen, &p_output, &i_outputlen);

if(ret != -1)

{

printf("srcbytes [%ld]\noutbytes [%ld]\n", i_sourcelen, i_outputlen);

// Final output should be 0xFF5E according to various tables, such as

// http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

printf("UTF16 code point [0x%X]\n", *p_codepoint);

}

else

{

printerror();

}

iconv_close(conv);

return 0;

}

Please let me know if this is expected behavior due to some factors I’m not aware of.

Thank you for your consideration.

Maxim Kouznetsov

Computer Scientist | Simba Technologies Inc. | A Magnitude Software Company
address@hidden

From:	Maxim Kouznetsov
Subject:	[bug-gnu-libiconv] Possible CP932 conversions bug
Date:	Mon, 12 Dec 2016 23:44:36 +0000