Cut not working with multi-byte UTF-8 characters

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Cut not working with multi-byte UTF-8 characters

From:	Patrik Hirvinen
Subject:	Cut not working with multi-byte UTF-8 characters
Date:	Sun, 09 Jul 2006 17:48:00 +0300
User-agent:	Mozilla Thunderbird 1.0.8 (X11/20060502)

Hi,

This bug was found on an Ubuntu 5.10 GNU/Linux x86 using cut version5.2.1. Locale used was en_US.UTF-8.

When fed text that includes multi-byte characters, cut makes theassumption that one byte corresponds to one character, even though thelocale would clearly suggest otherwise.

Attached is an example file, containing in UTF-8 format the character orUnicode code point U+00E4 and a newline, or in hexadecimal, "0xc3a40a"."cut -c 1 example.bin" should thus produce 'ä', yet it's output isidentical to "cut -b 1 example.bin", not "cut -b 2 example.bin" as itshould be.


Thanks

Patrik Hirvinen
address@hidden
+358-(0)40-7186320

example.bin
Description: Binary data

[Prev in Thread]

Current Thread

[Next in Thread]

Cut not working with multi-byte UTF-8 characters, Patrik Hirvinen <=
- Re: Cut not working with multi-byte UTF-8 characters, Eric Blake, 2006/07/10

Prev by Date: Re: Touch command.
Next by Date: Re: Touch command.
Previous by thread: sync from gnulib to coreutils, mostly for <ctype.h> and <dirent.h>
Next by thread: Re: Cut not working with multi-byte UTF-8 characters
Index(es):
- Date
- Thread