I stopped short of actually building uniq.c from source (bootstrap,
prerequisites, ...), but looking at the code, it looks like the call chain is:
different()
xmemcoll()
memcoll()
strcoll()
so I tried a little test at the strcoll() level:
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int
main (int argc, char **argv)
{
unsigned char null[] = {
0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
};
unsigned char iraq[] = {
0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0};
printf("%s\n", null);
printf("%s\n", iraq);
int m = strcoll(null, iraq);
printf("m = %d\n", m);
}
That correctly says the strings are different:
$ LANG=en_US.UTF-8 ./a.out
ⁿᵘˡˡ
ܥܝܪܐܩ
m = 6
On Dec 16, 2019, at 7:46 PM, Roy Smith <address@hidden> wrote:
Yup, this does depend on the locale. In my original example, I had
LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result:
$ LANG=C.UTF-8 uniq -c x
1 "ⁿᵘˡˡ"
1 "ܥܝܪܐܩ"
But, that doesn't fully explain what's going on. I find it difficult to believe that there's
any collation sequence in the world where those two strings should compare the same. I've
been playing around with the ICU string compare demo
<http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't
reproduce this there. Possibly I just haven't hit upon the right combination of options to
set, but I think it's far-fetched that there's any such combination for which those two
strings comparing equal is legitimate.