bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38627: uniq -c gets wrong count with non-ascii strings


From: Paul Eggert
Subject: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Mon, 16 Dec 2019 01:41:13 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2

On 12/15/19 11:40 AM, Roy Smith wrote:
> With the following input:
> 
>> $ cat x
>> "ⁿᵘˡˡ"
>> "ܥܝܪܐܩ"
> 
> 
> Running "uniq -c" says there's two copies of the same line!
> 
>> $ uniq -c x
>>       2 "ⁿᵘˡˡ"

Thanks for the bug report. I expect this is because GNU 'uniq' uses the
equivalent of strcoll (locale-dependent comparison) to compare lines, whereas
macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two
lines compare equal in your locale, GNU 'uniq' says there's just one line.

The GNU 'uniq' behavior appears to be a consequence of this commit:

commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc
Author: Jim Meyering <address@hidden>
Date:   Fri Aug 2 14:42:37 2002 +0000

with a change noted this way in NEWS:

* uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1.

However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq',
and I expect this means that the 2002 commit should be reverted so that GNU
'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense 
anyway).

I'll CC: this email to Jim Meyering to see whether he has an opinion about this.

In the meantime you can work around the problem by using 'LC_ALL=C uniq' instead
of plain 'uniq' in your shell script.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]