bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13362: tr does not work with UTF-8 locales


From: Urs Thuermann
Subject: bug#13362: tr does not work with UTF-8 locales
Date: 05 Jan 2013 12:53:00 +0100
User-agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7

The tr utility from coreutils-8.20 does not handle multi-byte
characters in UTF-8 correctly.  It seems the arguments and standard
input are read byte-by-byte instead of character-by-character.

Here are two examples, using the following UTF-8 characters (which are
also available in latin1, since this is what my mail software still
uses):

        ä (c3 a4), ö (c3 b6), ü(c3 bc), ¼ (c2 bc), ½ (c2 bd)

1. A call to tr -d ü does not delete that two byte sequence from the
   input but deletes any occurence of c3 or bc:

    address@hidden:~/coreutils-8.20$ locale
    LANG=C.UTF-8
    LANGUAGE=
    LC_CTYPE="C.UTF-8"
    LC_NUMERIC="C.UTF-8"
    LC_TIME="C.UTF-8"
    LC_COLLATE="C.UTF-8"
    LC_MONETARY="C.UTF-8"
    LC_MESSAGES="C.UTF-8"
    LC_PAPER="C.UTF-8"
    LC_NAME="C.UTF-8"
    LC_ADDRESS="C.UTF-8"
    LC_TELEPHONE="C.UTF-8"
    LC_MEASUREMENT="C.UTF-8"
    LC_IDENTIFICATION="C.UTF-8"
    LC_ALL=
    address@hidden:~/coreutils-8.20$ echo äöü¼|od -tx1
    0000000 c3 a4 c3 b6 c3 bc c2 bc 0a
    0000011
    address@hidden:~/coreutils-8.20$ echo äöü¼|tr -d ü|od -tx1
    0000000 a4 b6 c2 0a
    0000004

2. Replacing the single character ü (c3 bc) by the single character ½
   (c2 bd) does instead replace each c3 by c2 and each bc by bd:

    address@hidden:~/coreutils-8.20$ echo äöü¼|tr ü ½|od -tx1
    0000000 c2 a4 c2 b6 c2 bd c2 bd 0a
    0000011

urs





reply via email to

[Prev in Thread] Current Thread [Next in Thread]