bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe


From: Keith Marshall
Subject: Re: [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32
Date: Tue, 24 Apr 2007 19:53:01 +0100
User-agent: KMail/1.8.2

On Monday 23 April 2007 23:00, Bruno Haible wrote:
> There are two issues:
>
> 1) The approach used by libiconv for converting from/to wchar_t.
> Since the ISO C 99 standard does not "define" the representation of a
> wchar_t, the default approach is to convert through the locale
> encoding: wchar_t <--> char = locale encoding <--> other encoding.
> When one knows that the wchar_t encoding is Unicode, libiconv can
> convert directly. I didn't think about this case for Woe32 (since the
> main porting targets are Unix systems). I'm now applying the appended
> patch. It's low risk (except that it would be useful to know whether
> the Woe32 wchar_t[] encoding is really UCS-2 or UTF-16).

I can't find any definitive statement from Microsnot on this; I did find 
some presentations and blogs on microsoft.com, which *suggest* that 
some versions of Woe32 are UCS-2, and some (newer) ones are UTF-16, but 
they lack consistency, and none says explicitly which versions of 
MSVCRT implement which standard for wchar_t.  It's probably safest to 
assume UCS-2, for the base version preferred by MinGW.

> 2) The fact that on your system, the locale encoding for a Slovenian
> locale is CP1252.

No, I think you've misunderstood; on my system, the locale encoding is 
CP1252, for "English_United Kingdom".  I offered Slovenian as one 
example, of many I could have chosen, to explore and illustrate the 
problem I had observed.

> > The language is Slovenian, (although that choice is arbitrary),
> > the codeset is ISO-8859-2, and my woe32 box is configured with a
> > system code page, (which I don't have authority to change), of
> > CP1252. ...
> >   `locale_charset' does
> >
> >         #elif defined WIN32_NATIVE
> >  
> >           static char buf[2 + 10 + 1];
> >  
> >           /* Woe32 has a function returning the locale's
> >            codepage as a number.  */
> >           sprintf (buf, "CP%u", GetACP ());
> >           codeset = buf;
> >  
> >    which results in `tocode' being reassigned as `CP1252'; this
> >    seems somehow perverse
>
>    Indeed. I don't think you will get very far in such a locale. The
> use of "char *" to denote strings in locale-dependent encodings is
> pervasive in Unix and GNU software.

All I'm trying to do is parse an arbitrary byte stream, to step over 
multibyte groupings in any arbitrary input encoding, exactly as Ulrich 
Drepper does, in his `gencat' implementation to accompany the glibc 
implementation of `catgets'.  I'm using libiconv to achieve this, even 
for input codesets for which the system lacks a prepared code page.

>    I believe the installation of (proprietary) "language packs", such
> as for Hungarian, will allow you to get a locale with GetACP() =
> CP1250.

And this is precisely what I'm trying to avoid.

> >    begs a couple of questions:--
> >
> >    1a) If neither the `fromcode' nor the `tocode' is related to
> >        the current locale, why do we care what codeset is used
> >        in this locale?  What is the rationale for this change
> >        of `tocode' to the codeset mapped for `GetACP'?
>
> In general, the wchar_t representation is locale dependent. Examples
> are Solaris and FreeBSD.

Ok, understood.

> >    1b) Since `mbrtowc' functions in the context of the process'
> >        active LC_CTYPE, which doesn't even necessarily match the
> >        codeset from `GetACP', (it is more likely to simply be the
> >        "C" locale's portable character set), what is the rationale
> >        for even considering its use in this conversion context?
> >        Surely, it is unlikely to be appropriate.
>
> There is a fundamental assumption between mbrtowc and
> locale_charset(): the "char *" strings that are the input to mbrtowc
> are supposed to be encoded in locale_charset(). On Woe32, the MSVCRT
> library's implementation of mbtowc uses
> MultiByteToWideChar(__lc_codepage,...), where __lc_codepage is set by
> setlocale().

Yes, and it seems that what is set by setlocale() may not be reflected 
in what is returned by GetACP().

> >    2b) ... and is followed by
> >
> >           outcount = cd->ofuncs.xxx_wctomb(cd,outptr,wc,outleft);
> >           if (outcount != RET_ILUNI)
> >                goto outcount_ok;
> >
> >        which invokes `cp1252_wctomb', on the code returned from
> >        `iso8859_2_mbtowc'; in this case, the return value is not
> >        RET_ILUNI
>
> Yes, you cannot get very far when you try to use Slovenian strings in
> a locale whose encoding is CP1252.

Yes, but for my purposes, I can get far enough after applying the patch 
under discussion.

> > Now, observing that my GNU/Linux implementation of GCC *does*
> > define `__STDC_ISO_10646__', whereas the MinGW implementation
> > *does* *not*, suggests a possible work around for the failing
> > conversion on woe32; by arranging to have this symbol defined, with
> > any non-zero value
>
> Yes, this provides a workaround, limited to libiconv. I prefer to not
> define __STDC_ISO_10646__, because 'wchar_t' is only 16 bits, and
> ISO-10646 consists of many more than 65536 characters.

Ok.  I suggested setting __STDC_ISO_10646__, simply because it seemed 
less invasive than the alternative.  I'd also considered something very 
similar to what you've implemented.  Either achieves exactly the same 
effect, so I'm happy to adopt your preferred format.  I'll roll a new 
mingwPORT, with that included.

> > I'm less certain in the DJGPP case
>
> DJGPP has an entirely different libc. It doesn't have wchar_t
> functions at all, IIRC. Don't waste your brain cycles on it:

Oh, I wasn't planning to; I just wanted to point out that I don't have 
any experience with DJGPP, so wasn't prepared to attest to expected 
behaviour on that platform, even though I'd simply copied and pasted a 
conditional test incorporating it, from elsewhere in the file.

> DJGPP is not a porting target any more nowadays.

Understood.

Thanks,
Keith.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]