[bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings

bug-gnu-libiconv

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings

From:	Victor Stinner
Subject:	[bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
Date:	Mon, 09 May 2011 17:47:13 +0200

Hi,

Someone opened an issue in Python bug tracker asking to change how
invalid multibyte sequences are handled.
http://bugs.python.org/issue12016

b'\xffabc'.decode('gb2312', 'replace') gives "�bc". The 'a' character is
seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61}
is invalid in GB2312, the two bytes are replaced by U+FFFD.

Is it the "right" way to to do? Or should we ignore/replace 0xFF and
restart the decoder at 'a' to "�abc"?

UTF-8 decoder changed recently to ignore a single byte and restart the
decoder, so '\xF1\x80\x41\x42\x43' is now decoded "�ABC" instead "�C".
Should we do the same for all encodings? Or at least for asian encodings
(gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR,
CP949, Big5, CP950, ...)?

I hope that the question is not too much unrelated for your mailing
list.

Victor Stinner
PS: Can you please CC-me to your answers?

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings, Victor Stinner <=
- Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings, Bruno Haible, 2011/05/10

Prev by Date: Re: [bug-gnu-libiconv] Feature request for 'iconv'
Next by Date: Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
Previous by thread: [bug-gnu-libiconv] Feature request for 'iconv'
Next by thread: Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
Index(es):
- Date
- Thread