bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32
Date: Tue, 24 Apr 2007 00:00:10 +0200
User-agent: KMail/1.5.4

Keith Marshall wrote:
> I've built libiconv-1.11 on woe32, using the MinGW build of
> gcc-3.4.5, and the MSYS build tools from the MinGW project.
> 
> The good news is that it builds OOTB, and `make check' appears
> to complete successfully
> ...
> The bad news is that the implementation appears to be broken
> WRT codeset to wchar_t conversions, which incorrectly report
> EILSEQ errors when codeset != active system code page.
> 
> What follows is a fairly extensive, (and quite long), analysis
> of the problem.

Thanks for the detailed analysis.

There are two issues:

1) The approach used by libiconv for converting from/to wchar_t. Since
   the ISO C 99 standard does not "define" the representation of a wchar_t,
   the default approach is to convert through the locale encoding:
     wchar_t <--> char = locale encoding <--> other encoding.
   When one knows that the wchar_t encoding is Unicode, libiconv can convert
   directly. I didn't think about this case for Woe32 (since the main
   porting targets are Unix systems). I'm now applying the appended patch.
   It's low risk (except that it would be useful to know whether the Woe32
   wchar_t[] encoding is really UCS-2 or UTF-16).

2) The fact that on your system, the locale encoding for a Slovenian locale
   is CP1252.

> The language is Slovenian, (although that choice is arbitrary),
> the codeset is ISO-8859-2, and my woe32 box is configured with a
> system code page, (which I don't have authority to change), of
> CP1252. ...
>   `locale_charset' does
> 
>         #elif defined WIN32_NATIVE
>  
>           static char buf[2 + 10 + 1];
>  
>           /* Woe32 has a function returning the locale's
>            codepage as a number.  */
>           sprintf (buf, "CP%u", GetACP ());
>           codeset = buf;
>  
>    which results in `tocode' being reassigned as `CP1252'; this
>    seems somehow perverse

   Indeed. I don't think you will get very far in such a locale. The use
   of "char *" to denote strings in locale-dependent encodings is pervasive
   in Unix and GNU software.

   I believe the installation of (proprietary) "language packs", such as
   for Hungarian, will allow you to get a locale with GetACP() = CP1250.

>    begs a couple of questions:--
> 
>    1a) If neither the `fromcode' nor the `tocode' is related to
>        the current locale, why do we care what codeset is used
>        in this locale?  What is the rationale for this change
>        of `tocode' to the codeset mapped for `GetACP'?

In general, the wchar_t representation is locale dependent. Examples are
Solaris and FreeBSD.

>    1b) Since `mbrtowc' functions in the context of the process'
>        active LC_CTYPE, which doesn't even necessarily match the
>        codeset from `GetACP', (it is more likely to simply be the
>        "C" locale's portable character set), what is the rationale
>        for even considering its use in this conversion context?
>        Surely, it is unlikely to be appropriate.

There is a fundamental assumption between mbrtowc and locale_charset():
the "char *" strings that are the input to mbrtowc are supposed to be
encoded in locale_charset(). On Woe32, the MSVCRT library's implementation
of mbtowc uses MultiByteToWideChar(__lc_codepage,...), where __lc_codepage
is set by setlocale().

>    2b) ... and is followed by
> 
>           outcount = cd->ofuncs.xxx_wctomb(cd,outptr,wc,outleft);
>           if (outcount != RET_ILUNI)
>                goto outcount_ok;
> 
>        which invokes `cp1252_wctomb', on the code returned from
>        `iso8859_2_mbtowc'; in this case, the return value is not
>        RET_ILUNI

Yes, you cannot get very far when you try to use Slovenian strings in a
locale whose encoding is CP1252.

> Now, observing that my GNU/Linux implementation of GCC *does* define
> `__STDC_ISO_10646__', whereas the MinGW implementation *does* *not*,
> suggests a possible work around for the failing conversion on woe32;
> by arranging to have this symbol defined, with any non-zero value

Yes, this provides a workaround, limited to libiconv. I prefer to not
define __STDC_ISO_10646__, because 'wchar_t' is only 16 bits, and ISO-10646
consists of many more than 65536 characters.

> I'm less certain in the DJGPP case

DJGPP has an entirely different libc. It doesn't have wchar_t functions at
all, IIRC. Don't waste your brain cycles on it: DJGPP is not a porting target
any more nowadays.

Bruno


2007-04-23  Bruno Haible  <address@hidden>

        * lib/iconv.c (iconv_open, iconv_canonicalize): Treat native Woe32
        systems like those which define __STDC_ISO_10646__.
        Reported by Keith Marshall <address@hidden>.

*** lib/iconv.c 23 Jan 2006 13:25:49 -0000      1.24
--- lib/iconv.c 23 Apr 2007 21:07:28 -0000
***************
*** 1,5 ****
  /*
!  * Copyright (C) 1999-2006 Free Software Foundation, Inc.
   * This file is part of the GNU LIBICONV Library.
   *
   * The GNU LIBICONV Library is free software; you can redistribute it
--- 1,5 ----
  /*
!  * Copyright (C) 1999-2007 Free Software Foundation, Inc.
   * This file is part of the GNU LIBICONV Library.
   *
   * The GNU LIBICONV Library is free software; you can redistribute it
***************
*** 271,277 ****
        continue;
      }
      if (ap->encoding_index == ei_local_wchar_t) {
! #if __STDC_ISO_10646__
        if (sizeof(wchar_t) == 4) {
          to_index = ei_ucs4internal;
          break;
--- 271,277 ----
        continue;
      }
      if (ap->encoding_index == ei_local_wchar_t) {
! #if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !defined 
__CYGWIN__)
        if (sizeof(wchar_t) == 4) {
          to_index = ei_ucs4internal;
          break;
***************
*** 345,351 ****
        continue;
      }
      if (ap->encoding_index == ei_local_wchar_t) {
! #if __STDC_ISO_10646__
        if (sizeof(wchar_t) == 4) {
          from_index = ei_ucs4internal;
          break;
--- 345,351 ----
        continue;
      }
      if (ap->encoding_index == ei_local_wchar_t) {
! #if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !defined 
__CYGWIN__)
        if (sizeof(wchar_t) == 4) {
          from_index = ei_ucs4internal;
          break;
***************
*** 683,689 ****
        continue;
      }
      if (ap->encoding_index == ei_local_wchar_t) {
! #if __STDC_ISO_10646__
        if (sizeof(wchar_t) == 4) {
          index = ei_ucs4internal;
          break;
--- 683,689 ----
        continue;
      }
      if (ap->encoding_index == ei_local_wchar_t) {
! #if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !defined 
__CYGWIN__)
        if (sizeof(wchar_t) == 4) {
          index = ei_ucs4internal;
          break;





reply via email to

[Prev in Thread] Current Thread [Next in Thread]