coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: question about behavior of sort -n -t,


From: Pádraig Brady
Subject: Re: question about behavior of sort -n -t,
Date: Tue, 08 Oct 2013 23:34:58 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 10/08/2013 10:18 PM, Gabriel Gaster wrote:
> Hello all,
> 
> I have a question about the behavior of sort -n.
> 
> The premise of the question I asked on stackoverflow here 
> (http://stackoverflow.com/questions/19228968/unix-sort-n-t-gives-unexpected-result)
>  
> 
> Evidently, even if a user specifies a field-separator, the entire line is 
> still treated as a key. If the entire key is not numeric, then sort -n does 
> not throw any errors and seems to not do numeric sort and rather does some 
> other sort (the order of which I am unclear on). This strikes me as 
> unexpected behavior -- because the caller can think he's going to get numeric 
> sort and not get numeric sort.
> 
> As far as I can tell, specifying field-separator and calling numeric *should* 
> sort numerically _if_ the key is numeric.  Furthermore -- and I suppose this 
> is the main thing -- if a field-separator is specified, then the key should 
> default to each field and not to the entire line. Why else would one specify 
> a field-separator if not to use it in this way?
> 
> Can someone shed more light into this ? I'm also not sure if there is an 
> existing conversation about this, if it's being changed in a later release, 
> or if this is a known and long debated issue, or whatnot.
> 
> I'm eager to make contributions in this regard, of course. I would mostly 
> like to know the current discussion of these things and what the current 
> thinking is on sort -n -t','.

The main issue here is that your input is ambiguous wrt numbers.
When comparing numbers, the thousands separators are ignored
(even though in your locale they are misplaced.

Also note that while some of the sort funcionality is awkward,
it's done like that for backwards and cross compatibility reasons.

Also as you've noticed you would need to study the info documentation
very carefully understand fully what's going on.  So we've added the --debug
option to help one figure out what's going on (probably should have been
called --explain, but anyway...).

So consider, the following command where we specify --debug
to annotate the part of the line being matched as a number.
Also -s is specified to avoid the last resort sort to simplify the illustration.

$ sort --debug -s -t, -n t.csv
sort: using ‘en_US.utf8’ sorting rules
12,1.080339685
______________
58,1.49270399401
________________
59,0.00182092924724
___________________
12,13.080339685
_______________


You can see above that the numbers are interpreted as 121... 581... 590... 
1213...
and sorted accordingly. If you change to the C locale where there are no
thousands separators:

$ LANG=C sort --debug -s -t, -n t.csv
sort: using simple byte comparison
12,13.080339685
__
12,1.080339685
__
58,1.49270399401
__
59,0.00182092924724
__


If for some reason you want to honor locale rules then you might try to add -k1
but then you're warned about the sort spanning multiple fields:

$ sort --debug -s -t, -n -k1 t.csv
/home/padraig/git/coreutils/src/sort: using ‘en_US.utf8’ sorting rules
/home/padraig/git/coreutils/src/sort: key 1 is numeric and spans multiple fields
12,1.080339685
______________
58,1.49270399401
________________
59,0.00182092924724
___________________
12,13.080339685
_______________

So what you really want is to specify single fields like:

$ sort --debug -s -t, -n -k1,1 -k2,2 t.csv
/home/padraig/git/coreutils/src/sort: using ‘en_US.utf8’ sorting rules
12,1.080339685
__
   ___________
12,13.080339685
__
   ____________
58,1.49270399401
__
   _____________
59,0.00182092924724
__
   ________________




reply via email to

[Prev in Thread] Current Thread [Next in Thread]