[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#38627: uniq -c gets wrong count with non-ascii strings
From: |
Roy Smith |
Subject: |
bug#38627: uniq -c gets wrong count with non-ascii strings |
Date: |
Tue, 17 Dec 2019 12:25:54 -0500 |
I stopped short of actually building uniq.c from source (bootstrap,
prerequisites, ...), but looking at the code, it looks like the call chain is:
different()
xmemcoll()
memcoll()
strcoll()
so I tried a little test at the strcoll() level:
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int
main (int argc, char **argv)
{
unsigned char null[] = {
0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
};
unsigned char iraq[] = {
0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0};
printf("%s\n", null);
printf("%s\n", iraq);
int m = strcoll(null, iraq);
printf("m = %d\n", m);
}
That correctly says the strings are different:
$ LANG=en_US.UTF-8 ./a.out
ⁿᵘˡˡ
ܥܝܪܐܩ
m = 6
> On Dec 16, 2019, at 7:46 PM, Roy Smith <address@hidden> wrote:
>
> Yup, this does depend on the locale. In my original example, I had
> LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result:
>
>> $ LANG=C.UTF-8 uniq -c x
>> 1 "ⁿᵘˡˡ"
>> 1 "ܥܝܪܐܩ"
>
>
> But, that doesn't fully explain what's going on. I find it difficult to
> believe that there's any collation sequence in the world where those two
> strings should compare the same. I've been playing around with the ICU
> string compare demo
> <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't
> reproduce this there. Possibly I just haven't hit upon the right combination
> of options to set, but I think it's far-fetched that there's any such
> combination for which those two strings comparing equal is legitimate.
>