bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32


From: Keith Marshall
Subject: [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32
Date: Sun, 22 Apr 2007 14:52:19 +0100
User-agent: KMail/1.8.2

[Report to bug-gnu-libiconv; copy to MinGW-users for info]

I've built libiconv-1.11 on woe32, using the MinGW build of
gcc-3.4.5, and the MSYS build tools from the MinGW project.

The good news is that it builds OOTB, and `make check' appears
to complete successfully, (although it would be nice if the
result of each test was confirmed by printing `ok' for each
successful outcome).

The bad news is that the implementation appears to be broken
WRT codeset to wchar_t conversions, which incorrectly report
EILSEQ errors when codeset != active system code page.

What follows is a fairly extensive, (and quite long), analysis
of the problem.  I believe I have identified a possible work
around, although not a definitive solution, and would welcome
comments.

Here's an example, just one of many, taken from a parse of
the message catalogue sources provided with man-1.6; (I'm not
familiar with the language here; it just happens to be the
snippet around the first point of failure, in the residual
intermediate file left over from a successful build of the
entire set of available catalogues, on my GNU/Linux box).

   #include <stdlib.h>
   #include <string.h>
   #include <sys/types.h>
   #include <iconv.h>
   #include <errno.h>

   #ifndef ICONV_CONST
   #define ICONV_CONST
   #endif
   #define ICONV_CAST ICONV_CONST char **

   int main()
   {
     char *inptr;
     iconv_t mc = iconv_open( "wchar_t", "iso-8859-2" );
     char *input_string = "ni moè odpreti\\n";
     int inlen = strlen( input_string );
     wchar_t conv;
     size_t skip;

     for( inptr = input_string; inlen > 0; inptr += skip )
     {
       char *ptr = inptr;
       size_t convlen = sizeof( conv );
       wchar_t *convptr = &conv;
       size_t probe = 0;

       do { size_t try = ++probe;
            skip = iconv( mc, (ICONV_CAST)(&ptr), &try,
                          (char **)(&convptr), &convlen );
          }
       while( (skip == (size_t)(-1)) && (errno == EINVAL)
                    && (probe < inlen) );

       if( skip == (size_t)(-1) )
         perror( "iconv" );

       skip = (ptr == inptr) ? (size_t)(1) : ptr - inptr;
       inlen -= (int)(skip);
     }
     return 0;
   }

The language is Slovenian, (although that choice is arbitrary),
the codeset is ISO-8859-2, and my woe32 box is configured with a
system code page, (which I don't have authority to change), of
CP1252.  The sample text, defined as `input_string' appears
near the end of the second message defined in the `mess.sl'
file, in the man-1.6 distribution, and the fault occurs at the
sixth byte in that string.

I've built libiconv with CFLAGS='-g -O0', and compiled the above
sample code with

  gcc -g -O0 -otestcase -DICONV_CONST=const testcase.c -liconv

so I can trace it effectively in GDB, where I observe:--

1) In `iconv_open', the `tocode' is initially (correctly)
   identified as `wchar_t'; this causes the invocation of

        #if HAVE_MBRTOWC
              to_wchar = 1;
              tocode = locale_charset();
              continue;
        #endif

   and `locale_charset' does

        #elif defined WIN32_NATIVE
 
          static char buf[2 + 10 + 1];
 
          /* Woe32 has a function returning the locale's
           codepage as a number.  */
          sprintf (buf, "CP%u", GetACP ());
          codeset = buf;
 
   which results in `tocode' being reassigned as `CP1252'; this
   seems somehow perverse, and begs a couple of questions:--

   1a) If neither the `fromcode' nor the `tocode' is related to
       the current locale, why do we care what codeset is used
       in this locale?  What is the rationale for this change
       of `tocode' to the codeset mapped for `GetACP'?

   1b) Since `mbrtowc' functions in the context of the process'
       active LC_CTYPE, which doesn't even necessarily match the
       codeset from `GetACP', (it is more likely to simply be the
       "C" locale's portable character set), what is the rationale
       for even considering its use in this conversion context?
       Surely, it is unlikely to be appropriate.

2) In `iconv', (actually `libiconv'), for each of the first five
   bytes of the sample text, a successful conversion is obtained,
   with the result being the zero extended wchar_t representation,
   with identical numeric value to the original byte; conversion
   is correctly achieved in `iso8859_2_mbtowc' invoked indirectly
   by `unicode_loop_convert' via `wchar_to_loop_convert'.

   On return from `iso8859_2_mbtowc', in `unicode_loop_convert',
   I then see:

   2a) A check, to confirm that the conversion has not overrun
       an internal buffer; this succeeds...

   2b) ... and is followed by

          outcount = cd->ofuncs.xxx_wctomb(cd,outptr,wc,outleft);
          if (outcount != RET_ILUNI)
               goto outcount_ok;

       which invokes `cp1252_wctomb', on the code returned from
       `iso8859_2_mbtowc'; in this case, the return value is not
       RET_ILUNI, and control transfers to `outcount_ok', but to
       call `cp1252_wctomb' in this context does seem somewhat
       dubious, for the reason given in (3b) and (3c) below.

   2c) Following the jump to `outcount_ok', control returns to
       `wchar_to_loop_convert', where, after a validity check,
       I see:

          /* Successful conversion. */
          size_t bufcount = bufptr-buf; /* = BUF_SIZE-bufleft */
          mbstate_t state = wcd->state;
          wchar_t wc;
          res = mbrtowc(&wc,buf,bufcount,&state);
          if (res == (size_t)(-2)) ...

       which also seems questionable.  It clearly is trying to
       check if the current input byte is a possible lead byte
       in a multibyte sequence, but by using `mbrtowc', it is
       doing so WRT a codeset which may not match the input,
       (and does not, in the case in question); thus, is the
       result valid, or in any way useful?

3) Action (2) repeats, successfully converting each of the first
   five bytes of `input_string', before arriving at the sixth byte,
   (the accented `è', with input code `0xe8'), which is subjected to
   the same sequence of conversions as above, with the results:

   3a) `iso8859_2_mbtowc' returns a wchar_t conversion, with a
       value of `0x10d'; AFAICT, this is the correct value, and
       it is completely valid, in the input codeset.

   3b) As in (2b), `unicode_loop_convert' passes this `0x10d'
       code to `cp1252_wctomb', which recoils in horror, can't
       find a suitable representation in CP1252, and returns
       RET_ILUNI.

   3c) `unicode_loop_convert' now inspects the return code from
       `cp1252_wctomb', sees it was RET_ILUNI, and (incorrectly)
       decrees that the input byte was invalid; (it wasn't, but it
       definitely seems that the test performed on it was).  At
       this point, `unicode_loop_convert' gives up, sets `errno'
       to EILSEQ, immediately returns (size_t)(-1), and it's
       "Goodnight Vienna".

Now, if I repeat all of the above, but on my GNU/Linux box, (Ubuntu
6.06 with GCC-4.0.3), I see completely different, and entirely more
reasonable behaviour.  This system defines `__STDC_ISO_10646__', and
this code fragment, appearing in `iconv_open' immediately before the
fragment shown in (1)...

   #if __STDC_ISO_10646__
      if (sizeof(wchar_t) == 4) {
        to_index = ei_ucs4internal;
        break;
      }
      if (sizeof(wchar_t) == 2) {
        to_index = ei_ucs2internal;
        break;
      }
      if (sizeof(wchar_t) == 1) {
        to_index = ei_iso8859_1;
        break;
      }
   #endif

... prevents control from ever reaching those former questionable
statements; consequently, the converter control struct is configured
differently, and instead of the sequence described in (2) and (3),
I now see:--

4) Instead of invoking `unicode_loop_convert' indirectly, by way
   of a call to `wchar_to_loop_convert', `iconv' now passes control
   directly to `unicode_loop_convert', where:--

   4a) `iso8859_2_mbtowc' is again called, to get the wchar_t code
       for the input byte.

   4b) A similar check to that of (2a) is performed, then...

   4c) ... we again progress to the

          outcount = cd->ofuncs.xxx_wctomb(cd,outptr,wc,outleft);
          if (outcount != RET_ILUNI)
               goto outcount_ok;

       step; however, on this occasion `cd->ofuncs.xxx_wctomb' is
       mapped, not to anything associated with the current locale,
       but to `ucs4internal_wctomb'.  This has no problem with the
       wide character code generated by (4a), even for the case
       which is the analogue of (3b), and all is well.

Now, observing that my GNU/Linux implementation of GCC *does* define
`__STDC_ISO_10646__', whereas the MinGW implementation *does* *not*,
suggests a possible work around for the failing conversion on woe32;
by arranging to have this symbol defined, with any non-zero value,
either by patching MinGW's own `_mingw.h', (which works around the 
problem only for MinGW builds), or, (for a slightly more general woe32
or MS-DOS solution), by an `#ifdef _WIN32' guarded conditional 
definition within the libiconv source, e.g.

--- old/libiconv-1.11/lib/iconv.c   2006-01-23 13:16:12.000000000 +0000
+++ new/libiconv-1.11/lib/iconv.c   2007-04-22 14:05:09.000000000 +0100
@@ -18,6 +18,13 @@
  * Fifth Floor, Boston, MA 02110-1301, USA.
  */

+#if !defined(__STDC_ISO_10646__) \
+ && ((defined(_WIN32) && (defined(_MSC_VER) || defined(__MINGW32__))) \
+     || defined(__DJGPP__) \
+    )
+#     define __STDC_ISO_10646__   200009L
+#endif
+
 #include <iconv.h>

 #include <stdlib.h>

This causes the behaviour on woe32 to much more closely follow the 
GNU/Linux behaviour, with `ucs2internal_wctomb' substituted for the 
`ucs4internal_wctomb' call of the GNU/Linux case, and the valid code 
sequence of the above example, and those of the many other examples 
found in the same set of man-1.6 message catalogue sources, to pass 
correctly through the converter.

I'm less certain in the DJGPP case, but I think this is a reasonable 
work around for the woe32 cases, since AIUI the `wchar_t' of woe32,
for versions up to w2K is UCS-2, and for wXP and later it is UTF-16, 
both of which are conformant with ISO-10646.  Of course, it doesn't 
help, in any more general case, where the potential reference to the 
locale charset established in (1) still seems dubious.

Regards,
Keith.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]