bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Updating iconv tables


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Updating iconv tables
Date: Fri, 13 Jun 2008 01:59:16 +0200
User-agent: KMail/1.5.4

Dear Jim Breen,

> > You have a misconception of what EUC-JP is. EUC-JP is a character encoding
> > scheme based on three standards: ASCII, JIS X 0208, and JIS X 0212. These
> > are standards issued by Japanese authorities, and carved in stone. Anyone
> > who thinks that EUC-JP tables have to be "kept up-to-date", is asking for
> > deviation from standards, and is asking for interoperability problems!
> 
> You are out-of-date there. EUC-JP also includes JIS X 0213 ...

Wrong. The ultimative reference (standard) for character sets and their
definition, regarding their practical use, is the IANA character set registry:
  http://www.iana.org/assignments/character-sets

It says that EUC-JP is composed of

               code set 0: US-ASCII (a single 7-bit byte set)
               code set 1: JIS X0208-1990 (a double 8-bit byte set)
                           restricted to A0-FF in both bytes
               code set 2: Half Width Katakana (a single 7-bit byte set)
                           requiring SS2 as the character prefix
               code set 3: JIS X0212-1990 (a double 7-bit byte set)
                           restricted to A0-FF in both bytes
                           requiring SS3 as the character prefix

> The codepoint I raised arrived in JIS X 0213. 
> See: http://en.wikipedia.org/wiki/JIS_X_0213 for an overview.

This page refers to http://en.wikipedia.org/wiki/EUC
which says that the encoding that looks like EUC-JP but uses JIS X 0213
is called EUC-JISX0213.

And indeed the character that you meant to show me (bytes 0xAD 0xEA)
in EUC-JISX0213 is U+3231. In EUC-JISX0213, but not in EUC-JP.

> You can think of JIS X 0213 as an enhancement/replacement for JIS X 0208.

In the same sense, you can "think of" EUC-JISX0213 as an enhancement of
EUC-JP. But this "enhancement" has two caveats:

  1) Compared to EUC-JP, EUC-JISX0213 removes 6068 code points, and adds
     4355 code points instead. It by no way an "enhancement" to drop more
     1000 characters!

  2) EUC-JISX0213 can be used via 'iconv', but cannot be used as a locale
     encoding in glibc based systems. This is because glibc has chosen to
     use Unicode characters as 'wchar_t' representation, and there are some
     characters in JISX0213 which don't map 1:1 to Unicode (rather 1:2,
     requiring the use of combining Unicode characters).

> Of course EUC-JP tables need to be kept up-to-date.

There is nothing to keep up-to-date. EUC-JP is based on JISX 0208 and JISX 0212.
JISX 0213 is not an new version of JISX 0208 or JISX 0212, it is a new and
*different* standard. Therefore in glibc we call it EUC-JISX0213.

> > Take a look at
> >   http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html
> > to see how many variants of EUC-JP already exist!
> 
> Sadly your WWW page omits any mention of JIS X 0213.

That's because EUC-JISX0213 is not even remotely backward compatible with
EUC-JP.

Look at
  http://www.haible.de/bruno/charsets/conversion-tables/Japanese.html

> Sun has simply kept up with the developments in Japanese coding. These are
> *not* vendor extensions.

I don't know what Sun did. But if they were providing EUC-JISX0213 under the
name "EUC-JP", that would be a very bad (because not standards compliant) move.

> In case you think I am talking through my hat, I must point out that I am
> one of only a handful of non-Japanese people who have participated in the
> development of the Japanese standards.

Oh, you are arguing by intimidation? Then I have to point out that I have
contributed implementations of EUC-JISX0213 and SHIFT_JISX0213 to GNU libc
and GNU libiconv in 2002, before any other vendor's iconv had it.

> I am happy to work with you in getting the full set of current Japanese
> codes into iconv. As it stands at the moment, the GNU issue does not
> adequately hand all the standard Japanese codes.

As it stands at the moment, GNU libc and GNU libiconv have all the standard
Japanese encodings; only you confused the names.

Bruno

PS: I have no idea in which encoding your EDICT dictionary now actually is.
    If you started out writing it in EUC-JP and at some point switched to
    using EUC-JISX0213, you may have dozens of entries which are correct in
    EUC-JP but wrong in EUC-JISX0213, and dozens of entries for which it is
    the opposite.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]