[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#38627: uniq -c gets wrong count with non-ascii strings
From: |
Roy Smith |
Subject: |
bug#38627: uniq -c gets wrong count with non-ascii strings |
Date: |
Mon, 16 Dec 2019 19:46:39 -0500 |
Yup, this does depend on the locale. In my original example, I had
LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result:
> $ LANG=C.UTF-8 uniq -c x
> 1 "ⁿᵘˡˡ"
> 1 "ܥܝܪܐܩ"
But, that doesn't fully explain what's going on. I find it difficult to
believe that there's any collation sequence in the world where those two
strings should compare the same. I've been playing around with the ICU string
compare demo <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col>
and can't reproduce this there. Possibly I just haven't hit upon the right
combination of options to set, but I think it's far-fetched that there's any
such combination for which those two strings comparing equal is legitimate.