bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings


From: Victor Stinner
Subject: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
Date: Mon, 09 May 2011 17:47:13 +0200

Hi,

Someone opened an issue in Python bug tracker asking to change how
invalid multibyte sequences are handled.
http://bugs.python.org/issue12016

b'\xffabc'.decode('gb2312', 'replace') gives "�bc". The 'a' character is
seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61}
is invalid in GB2312, the two bytes are replaced by U+FFFD.

Is it the "right" way to to do? Or should we ignore/replace 0xFF and
restart the decoder at 'a' to "�abc"?

UTF-8 decoder changed recently to ignore a single byte and restart the
decoder, so '\xF1\x80\x41\x42\x43' is now decoded "�ABC" instead "�C".
Should we do the same for all encodings? Or at least for asian encodings
(gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR,
CP949, Big5, CP950, ...)?

I hope that the question is not too much unrelated for your mailing
list.

Victor Stinner
PS: Can you please CC-me to your answers?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]