bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38627: uniq -c gets wrong count with non-ascii strings


From: Roy Smith
Subject: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Tue, 17 Dec 2019 12:25:54 -0500

I stopped short of actually building uniq.c from source (bootstrap, 
prerequisites, ...), but looking at the code, it looks like the call chain is:

different()
xmemcoll()
memcoll()
strcoll()

so I tried a little test at the strcoll() level:

#include <stdio.h>
#include <unistd.h>
#include <string.h>

int
main (int argc, char **argv)
{
  unsigned char null[] = {

    0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
  };
  unsigned char iraq[] = {
    0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0};

  printf("%s\n", null);
  printf("%s\n", iraq);

  int m = strcoll(null, iraq);
  printf("m = %d\n", m);
}

That correctly says the strings are different:

$ LANG=en_US.UTF-8 ./a.out
ⁿᵘˡˡ
ܥܝܪܐܩ
m = 6






> On Dec 16, 2019, at 7:46 PM, Roy Smith <address@hidden> wrote:
> 
> Yup, this does depend on the locale.  In my original example, I had 
> LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:
> 
>> $ LANG=C.UTF-8 uniq -c x
>>       1 "ⁿᵘˡˡ"
>>       1 "ܥܝܪܐܩ"
> 
> 
> But, that doesn't fully explain what's going on.  I find it difficult to 
> believe that there's any collation sequence in the world where those two 
> strings should compare the same.  I've been playing around with the ICU 
> string compare demo 
> <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't 
> reproduce this there.  Possibly I just haven't hit upon the right combination 
> of options to set, but I think it's far-fetched that there's any such 
> combination for which those two strings comparing equal is legitimate.
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]