bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like t


From: Mingye Wang (Arthur2e5)
Subject: Re: [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their windows counterparts
Date: Thu, 24 Nov 2016 16:33:22 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2

Hi,

Bruno Haible wrote:
There are two implementations of 'iconv' in GNU, one in glibc, and one
in libiconv. Here you are writing about glibc behaviour, for which you
can report bugs in the glibc bugzilla. But I can give you some background
anyway.

Hmm. I should find some time to forward some parts of this report (like cp936) to glibc.

>> cp936
You can see from http://haible.de/bruno/charsets/conversion-tables/sources.html
that the site considers Windows versions up to October 2016.

Great collection... again.

libiconv includes the '0x80 U+20AC' mapping for CP936; glibc doesn't.
Maybe because this Euro sign is not contained in the GBK 1.0 standard
(see https://en.wikipedia.org/wiki/GBK#GBK_1.0). Maybe because U+20AC
is mapped to a different codepoint in GB18030 and GB18030 is meant to
be an extension of GBK.

I guess these tables should be kept 'like Windows' as long as they are referring to a Windows Code Page. It seems that glibc simply did an alias... well.

Also, confirmed working in Cygwin where `iconv` actually comes from GNU libiconv 1.14.

cp950 has no mappings for HKSCS
-------------------------------

Reference:
http://haible.de/bruno/charsets/conversion-tables/Big5.html

It is not a good idea to propagate arbitrary modifications of existing 
encodings,
because it causes interoperability problems. You are actually calling it a 
"hack".

I am calling it a hack because MS is pushing a separate code page number (951) to mask 950 in their Windows XP support package. Not quite the case for later Windows releases...

As you can see from http://haible.de/bruno/charsets/conversion-tables/Big5.html
(search for windows-2016/CP950.TXT), you can see that on Windows 10, CP950
does *not* contain HKSCS extension mappings.

It seems that libiconv *does* have private area mappings for Big5's user-defined blocks in cp950. glibc aliased CP950 to Big5, so it's going to be their fault. Another false alarm.

On cygwin iconv seems to be able to accept \x87\x40 and give \ue000 but not the other way around.

Likewise for the official mapping tables provided by Microsoft:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt

*todo: bug glibc for cp950 EUDC too*


Since it's a bidirectional
conversion, this assignment is not part of "best fit" behavior per [4].

You're mistaken. The point of the "best fit" converters in Microsoft is that
they document also the conversions that go only in one direction. i.e. that
don't round-trip.

I thought these round-trip parts are not doing "best fit" and should be considered (somehow) normative.

0x81 and 0x8d for cp1252, etc.
------------------------------
Reference:
http://haible.de/bruno/charsets/conversion-tables/CP1252.html

See the tables provided by Microsoft:
https://msdn.microsoft.com/en-us/library/cc195054.aspx
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
As you can see, 0x81 and 0x8D are not mapped.

This actually brings up why I keep going straight to the round-trip subset of these "best fit" mappings...

MS used to serve its cp950 mapping at their "go global" site[1], which now redirects to a "not found" page at[2]. This page points you to the best fit mappings that, in turn, looks like what I have on current versions of Windows. As a result, I actually thought these "WINDOWS" mappings were somehow obsolete. [1]: https://web.archive.org/web/20110807111716/http://msdn.microsoft.com/en-us/goglobal/cc305155
  [2]: https://msdn.microsoft.com/en-us/globalization/mt767590

Yes the Windows converter does it differently...
>

In summary, choosing the right conversion table is a tricky choice. Don't
think that what a converter on Windows does it always the right or best option!

I still expect Windows Code Pages to be defined by Windows itself...

--
Regards,

Arthur2e5

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]