bug#38627: uniq -c gets wrong count with non-ascii strings

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38627: uniq -c gets wrong count with non-ascii strings

From:	Roy Smith
Subject:	bug#38627: uniq -c gets wrong count with non-ascii strings
Date:	Mon, 16 Dec 2019 19:46:39 -0500

Yup, this does depend on the locale.  In my original example, I had 
LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:

> $ LANG=C.UTF-8 uniq -c x
>       1 "ⁿᵘˡˡ"
>       1 "ܥܝܪܐܩ"


But, that doesn't fully explain what's going on.  I find it difficult to 
believe that there's any collation sequence in the world where those two 
strings should compare the same.  I've been playing around with the ICU string 
compare demo <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> 
and can't reproduce this there.  Possibly I just haven't hit upon the right 
combination of options to set, but I think it's far-fetched that there's any 
such combination for which those two strings comparing equal is legitimate.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#38627: uniq -c gets wrong count with non-ascii strings, Roy Smith, 2019/12/15
- bug#38627: uniq -c gets wrong count with non-ascii strings, Paul Eggert, 2019/12/16
  - bug#38627: uniq -c gets wrong count with non-ascii strings, Roy Smith <=
    - bug#38627: uniq -c gets wrong count with non-ascii strings, Roy Smith, 2019/12/17
  - bug#38627: uniq -c gets wrong count with non-ascii strings, Jim Meyering, 2019/12/17
- bug#38627: uniq -c gets wrong count with non-ascii strings, Bruno Haible, 2019/12/17

Prev by Date: bug#38621: gdu showing different sizes
Next by Date: bug#38627: uniq -c gets wrong count with non-ascii strings
Previous by thread: bug#38627: uniq -c gets wrong count with non-ascii strings
Next by thread: bug#38627: uniq -c gets wrong count with non-ascii strings
Index(es):
- Date
- Thread