bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Skipping over EILSEQ & EINVAL errors?


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Skipping over EILSEQ & EINVAL errors?
Date: Tue, 16 Sep 2008 01:49:10 +0200
User-agent: KMail/1.5.4

Hi,

address@hidden wrote:
> Can anyone explain why the iconv binary can successfully skip over bad
> characters in the input (EILSEQ & EINVAL errors), but the libiconv
> conversion function cannot?

Good question ;-) The answer is: POSIX specified the iconv program in this
way [1], and it specified the iconv function in this way [2].

It is correct that you cannot easily implement the skipping over bad
input, as required for the iconv program, with the iconv() function.
GNU libc has a different internal API that allows this (gconv), and GNU
libiconv another internal API (iconvctl).

But the gnulib module 'striconveh' [3] contains portable code for error
handling with iconv. It supports three error handlers:

/* Handling of unconvertible characters.  */
enum iconv_ilseq_handler
{
  iconveh_error,                /* return and set errno = EILSEQ */
  iconveh_question_mark,        /* use one '?' per unconvertible character */
  iconveh_escape_sequence       /* use escape sequence \uxxxx or \Uxxxxxxxx */
};

> I've been trying to use libiconv to convert CJK files into UTF-8.
> 
> I've noticed that when I run something like this (using the iconv
> binary, from a command line):
> 
> /usr/bin/iconv -f gb2312 [chinese language file encoded as gb2312]

This error often happens when the input is not in GB2312, but in related
encodings such as GBK or GB18030 [4].

> In the code example
> (http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html),
> hitting an EILSEQ or EINVAL error is cause for stopping processing.

That's only because it's meant to be a _simple_ example :-)

> In my own code, I've tried to lseek forward if I get either of those
> errors, but the iconv function gives no indication of how large the bad
> input is, or where the next "clean" byte is.

Yes, this error handling can be tricky. In particular, skipping just 1 byte
in UTF-16 or UTF-32 encoded input is probably a bad idea.

Bruno

[1] http://www.opengroup.org/susv3/utilities/iconv.html
[2] http://www.opengroup.org/susv3/functions/iconv.html
[3] http://www.gnu.org/software/gnulib/MODULES.html#module=striconveh
[4] http://www.haible.de/bruno/charsets/conversion-tables/Chinese.html





reply via email to

[Prev in Thread] Current Thread [Next in Thread]