bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] iconv incorrectly converts escape characters 0x1b fro


From: Seikoh NISHITA
Subject: [bug-gnu-libiconv] iconv incorrectly converts escape characters 0x1b from UTF-8 to ISO-2022-JP
Date: Tue, 24 Mar 2015 11:45:05 +0900

ISO-2022-JP is one of the popular character encoding schemes for email
texts in Japan.
I report incorrect conversion by iconv w.r.t. ISO-2022-JP.

The byte value 0x1b in UTF-8 text is converted to the same byte value
in ISO-2022-JP by iconv.
This conversion does not follow the specification of ISO-2022-JP.
As a result, the round-trip conversion between UTF-8 and ISO-2022-JP
is impaired.
Although escape sequences in UTF-8 text look strange, such text might
be generated by a software
that unexpectedly accepts escape sequences as user input and
concatenates them with embedded character sequence.


The following is what I tried with iconv version 1.11 and a terminal
emulater on Mac OS X.

  $ echo -en "\x1b" > a.txt
  $ od -tx1 a.txt
  0000000    1b
  0000001
  $ iconv -f UTF-8 -t ISO-2022-JP a.txt >b.txt
  $ od -tx1 b.txt
  0000000    1b
  0000001
  $
  (the byte value 0x1b in UTF-8 text is converted to the same byte
value in ISO-2022-JP by iconv.)

  $ echo -en "\x1b\x24\x42\x46\x7c" > x.txt
  $ cat x.txt
  BF|
  $ iconv -f UTF-8 -t ISO-2022-JP x.txt > y.txt
  $ iconv -f ISO-2022-JP -t UTF-8 y.txt > z.txt
  $ cat z.txt
  日
  (the round-trip conversion between UTF-8 and ISO-2022-JP fails in this case.)

The last character is Japanese Kanji character Nichi, which is found
at following Web page:
   http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=65E5

Actually the text x.txt and y.txt has the same byte sequence, 0x1b 24 42 46 7c.
But the sequence is interpreted differently in UTF-8 and ISO-2022-JP.

   UTF-8 interpretation:
      ESC(1b)  $(24)  B(42)  F(46)  |(7c)

   ISO-2022-JP interpret.:
      escape sequence (1b 24 42)   Japanese character Nichi (46 7c)


According to RFC 1468 that defines ISO-2022-JP, escape characters are
only used as the start characters of escape sequences in order to
switch character set.

  [Quotation of Section "Formal Syntax" in RFC 1468]

   single-byte-seq     = ESC "(" ( "B" / "J" )

   double-byte-seq     = ESC "$" ( "@" / "B" )

   single-byte-char    = <any 7BIT, including bare CR & bare LF, but NOT
                          including CRLF, and not including ESC, SI, SO>



-- 
------------------------------------------------------
Seikoh Nishita
Department of Computer Science,
Faculty of Engineering, Takushoku University
815-1, Tate-machi
Hachioji city, Tokyo
193-0985, Japan
Tel: +81-42-665-8529, +81-42-665-1441 (ex. 5308)
Fax: +81-42-665-1519
E-Mail: address@hidden

西田 誠幸 (にした せいこう)
〒193-0985 東京都八王子市館町815-1
拓殖大学工学部情報工学科
Tel: 042-665-8529, 042-665-1441 (ex. 5308)
Fax: 042-665-1519
E-Mail: address@hidden



reply via email to

[Prev in Thread] Current Thread [Next in Thread]