|
From: | Patrik Hirvinen |
Subject: | Cut not working with multi-byte UTF-8 characters |
Date: | Sun, 09 Jul 2006 17:48:00 +0300 |
User-agent: | Mozilla Thunderbird 1.0.8 (X11/20060502) |
Hi,This bug was found on an Ubuntu 5.10 GNU/Linux x86 using cut version 5.2.1. Locale used was en_US.UTF-8.
When fed text that includes multi-byte characters, cut makes the assumption that one byte corresponds to one character, even though the locale would clearly suggest otherwise.
Attached is an example file, containing in UTF-8 format the character or Unicode code point U+00E4 and a newline, or in hexadecimal, "0xc3a40a". "cut -c 1 example.bin" should thus produce 'รค', yet it's output is identical to "cut -b 1 example.bin", not "cut -b 2 example.bin" as it should be.
Thanks Patrik Hirvinen address@hidden +358-(0)40-7186320
example.bin
Description: Binary data
[Prev in Thread] | Current Thread | [Next in Thread] |