bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tr broken with accented chars


From: Eric Blake
Subject: Re: tr broken with accented chars
Date: Fri, 21 Apr 2006 15:57:24 +0000

> I have a problem with tr, version 5.94 :

> I'm using debian with a 100% utf-8 system. It is not an x-term related
> problem (this also occurs in a vt). Quoting the arguments (tr "é" "e")
> does not help.

Thanks for the report.  However, upstream coreutils does not yet
support multi-byte characters.  The TODO file documents the need
for a nice patch that handles multibyte characters cleanly, while
not penalizing speed of strict single-byte locales; and so far, while
several vendors have provided add-on patches that attempt
this, none of them have been considered clean enough to apply
upstream.

> address@hidden:~$ echo hello | tr o a # no problem here
> hella

Even in utf-8, all these characters are single bytes.

> 
> address@hidden:~$ echo hé | tr é e # why do I get 2 'e' ?
> hee

In utf-8, é occupies 2 bytes, but e occupies one, and single-byte
translation is occuring, so this bit from the info pages is relevant:
   "On the other hand, making SET1 longer than SET2 is not portable;
POSIX says that the result is undefined.  In this situation, BSD `tr'
pads SET2 to the length of SET1 by repeating the last character of SET2
as many times as necessary.  System V `tr' truncates SET1 to the length
of SET2."
Thus, both utf-8 bytes of é are being translated into the
expanded SET2 of ee.

> 
> address@hidden:~$ echo hé | tr à a # here tr should do nothing...
> ha(c)
> 

Again, é and à are multibyte, and share a common byte, so with
single-byte translation, the common byte is translated to a, and
the remaining byte is passed through unchanged but now
forms an illegal utf-8 sequence.

-- 
Eric Blake




reply via email to

[Prev in Thread] Current Thread [Next in Thread]