[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#13362: tr does not work with UTF-8 locales
From: |
Urs Thuermann |
Subject: |
bug#13362: tr does not work with UTF-8 locales |
Date: |
05 Jan 2013 12:53:00 +0100 |
User-agent: |
Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7 |
The tr utility from coreutils-8.20 does not handle multi-byte
characters in UTF-8 correctly. It seems the arguments and standard
input are read byte-by-byte instead of character-by-character.
Here are two examples, using the following UTF-8 characters (which are
also available in latin1, since this is what my mail software still
uses):
ä (c3 a4), ö (c3 b6), ü(c3 bc), ¼ (c2 bc), ½ (c2 bd)
1. A call to tr -d ü does not delete that two byte sequence from the
input but deletes any occurence of c3 or bc:
address@hidden:~/coreutils-8.20$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=
address@hidden:~/coreutils-8.20$ echo äöü¼|od -tx1
0000000 c3 a4 c3 b6 c3 bc c2 bc 0a
0000011
address@hidden:~/coreutils-8.20$ echo äöü¼|tr -d ü|od -tx1
0000000 a4 b6 c2 0a
0000004
2. Replacing the single character ü (c3 bc) by the single character ½
(c2 bd) does instead replace each c3 by c2 and each bc by bd:
address@hidden:~/coreutils-8.20$ echo äöü¼|tr ü ½|od -tx1
0000000 c2 a4 c2 b6 c2 bd c2 bd 0a
0000011
urs
- bug#13362: tr does not work with UTF-8 locales,
Urs Thuermann <=