bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38627: uniq -c gets wrong count with non-ascii strings


From: Jim Meyering
Subject: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Tue, 17 Dec 2019 15:10:33 -0800

On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert <address@hidden> wrote:
> On 12/15/19 11:40 AM, Roy Smith wrote:
> > With the following input:
> >
> >> $ cat x
> >> "ⁿᵘˡˡ"
> >> "ܥܝܪܐܩ"
> >
> >
> > Running "uniq -c" says there's two copies of the same line!
> >
> >> $ uniq -c x
> >>       2 "ⁿᵘˡˡ"
>
> Thanks for the bug report. I expect this is because GNU 'uniq' uses the
> equivalent of strcoll (locale-dependent comparison) to compare lines, whereas
> macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two
> lines compare equal in your locale, GNU 'uniq' says there's just one line.
>
> The GNU 'uniq' behavior appears to be a consequence of this commit:
>
> commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc
> Author: Jim Meyering <address@hidden>
> Date:   Fri Aug 2 14:42:37 2002 +0000
>
> with a change noted this way in NEWS:
>
> * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1.
>
> However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq',
> and I expect this means that the 2002 commit should be reverted so that GNU
> 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense 
> anyway).
>
> I'll CC: this email to Jim Meyering to see whether he has an opinion about 
> this.
>
> In the meantime you can work around the problem by using 'LC_ALL=C uniq' 
> instead
> of plain 'uniq' in your shell script.

Thanks for the report, Roy, and thanks Paul for diving in.
I confess I haven't done more than look at that old diff, but this
sure sounds like a bug we must fix, to get in line with the the much
more recent POSIX spec.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]