coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sort order issue


From: Eric Blake
Subject: Re: Sort order issue
Date: Wed, 13 Jan 2016 08:22:18 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0

On 01/13/2016 07:44 AM, Salvador Girbau wrote:
> Hello,
> (apologies in advance for a possibly repeated bug report -- it is hard
> to check for duplicates on collating sequence bugs.)
> 
> The following bash script illustrates the issue.
> Thanks
> 

> cat << EOF > c          # like 'b', but with the digits 123 instead of 111
> a...123
> b...123
> b1..123
> ZZ..123
> EOF
> 
> echo "Is 'a' sorted?"
> sort a | diff - a       # no differences
> echo "Is 'b' sorted?"
> sort b | diff - b       # no differences
> echo "Is 'c' sorted?"
> sort c | diff - c       # differences! why?

'sort --debug' is (usually) your friend, although in this case it
doesn't show much:

$ sort --debug b
sort: using ‘en_US.UTF-8’ sorting rules
a...111
_______
b...111
_______
b1..111
_______
ZZ..111
_______

$ sort --debug c
sort: using ‘en_US.UTF-8’ sorting rules
a...123
_______
b1..123
_______
b...123
_______
ZZ..123
_______

So let's look deeper:

$ ltrace -e strcoll sort b
sort->strcoll("b1..111", "ZZ..111")              = -24
sort->strcoll("a...111", "b...111")              = -1
sort->strcoll("a...111", "b1..111")              = -1
a...111
sort->strcoll("b...111", "b1..111")              = -1
b...111
b1..111
ZZ..111
+++ exited (status 0) +++

$ ltrace -e strcoll sort c
sort->strcoll("b1..123", "ZZ..123")              = -24
sort->strcoll("a...123", "b...123")              = -1
sort->strcoll("a...123", "b1..123")              = -1
a...123
sort->strcoll("b...123", "b1..123")              = 1
b1..123
sort->strcoll("b...123", "ZZ..123")              = -24
b...123
ZZ..123
+++ exited (status 0) +++


> 
> # issue worked around by exporting LC_ALL=C
> # (which, of course, changes the ordering entirely,
> #  f.i. uppercase ZZ will come before the lowercase words)

Sort is obeying your locales.  In en_US.UTF-8, the sort order is defined
as case-insensitive with punctuation ignored.  That is, in file 'b', you
are comparing 'b...111' with 'b1..111', but that is the same comparison
as 'b111' with 'b1111' (with the punctuation removed, you now have an
identical prefix, so the longer string compares later); in file 'c', you
are comparing 'b...123' with 'b1..123', but that is the same comparison
as 'b123' with 'b1123' (now the prefix is no longer identical, and 'b11'
compares before 'b12').

What you are seeing is not a bug, but an artifact of your locale's
collation sequence rules.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]