coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] Bug (?) in sort -R


From: Eric Blake
Subject: Re: [coreutils] Bug (?) in sort -R
Date: Mon, 16 Aug 2010 13:47:15 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.7) Gecko/20100720 Fedora/3.1.1-1.fc13 Lightning/1.0b2pre Mnenhy/0.8.3 Thunderbird/3.1.1

On 08/16/2010 01:22 PM, Jason wrote:
> I can't decide if this is a bug or not. Apologies if this has already been
> discussed I am pretty new to the list. I'm using the latest git version,
> 8.5.136-6d78c.
> 
> If you do
> 
> sort -R -k 4,4 a > b
> 
> the relative ordering of column 4 is different then if you do
> 
> sort -R -k 4,5 a > b.

Thanks for the report.

First, remember that if you don't use -s, then sort adds an implicit
option of '-k 1' (that is, the entire line is treated as a tie-breaker),
which can affect results.

Also remember that without -b, the amount of whitespace preceding a
field is significant to some, but not all, of your sort fields, which
will impact the string subjected to the random hashing.

> 
> (obviously the actual order in the output file is different on every run
> unless you pass in the same random data to get the same ordering)
> 
> It'd seem that the individual columns should be hashed and sorted
> independently in order to maintain the normal ordering of the primary sort
> column.

Nope, sort is based on the key, and if you request a key that spans two
columns, then you are hashing a different value than if you request a
key that spans one column.  If you really want to hash the two columns
independently, then tell that to sort:

sort -s -R -k 4,4 -k 5,5 a

> ~/coreutils/coreutils> src/sort --version
> sort (GNU coreutils) 8.5.136-6d78c
> 
> This is also true if you use the -s flag with only one field specified,
> which is a slightly different flavor of the same bug.
> 
> ~/coreutils/coreutils> src/sort -s -R -k 4 a

With only one of the two key fields specified, you are asking sort to go
from that key to the end of the line.  So 'sort -k 4' is different than
'sort -k 4,4', and hashing different strings.

So far, I don't think you have managed to pinpoint any bugs in sort, but
only in your usage of it.  The next version of coreutils will include
the --debug option to sort, to make analysis of your input a little
easier to follow:

$ sort --debug -R -k 4,4 a
sort: using `en_US.UTF-8' sorting rules
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c e e
     __
_________
a b c e f
     __
_________
a b c d e
     __
_________
a b c d f
     __
_________
a b c d g
     __
_________

$ LC_ALL=C sort --debug -R -k 4,4 a
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c d e
     __
_________
a b c d f
     __
_________
a b c d g
     __
_________
a b c e e
     __
_________
a b c e f
     __
_________

$ LC_ALL=C sort --debug -s -R -k 4 a
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c e e
     ____
a b c d f
     ____
a b c d g
     ____
a b c d e
     ____
a b c e f
     ____

$ LC_ALL=C sort --debug -s -R -k 4,5 a
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c e f
     ____
a b c e e
     ____
a b c d e
     ____
a b c d f
     ____
a b c d g
     ____

$ LC_ALL=C sort --debug -s -b -R -k 4,4 -k 5,5 a
sort: using simple byte comparison
a b c d g
      _
        _
a b c d e
      _
        _
a b c d f
      _
        _
a b c e e
      _
        _
a b c e f
      _
        _


-- 
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]