[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#19021: Possible bug in sort
From: |
Leslie S Satenstein |
Subject: |
bug#19021: Possible bug in sort |
Date: |
Tue, 11 Nov 2014 18:27:49 +0000 (UTC) |
Why not have used sort -t ',' -k 1n ?
Regards
Leslie
Mr. Leslie Satenstein
Montréal Québec, Canada
From: Eric Blake <address@hidden>
To: Ben Mendis <address@hidden>; address@hidden
Sent: Tuesday, November 11, 2014 12:39 PM
Subject: bug#19021: Possible bug in sort
tag 19021 notabug
thanks
On 11/11/2014 09:39 AM, Ben Mendis wrote:
> http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc
>
> Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3
Thanks for the report. Rather than making us chase down links, why not
provide the information inline with your email?
>
> This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv
Try using the --debug option to see what is really happening. The bug
is NOT in sort (which correctly obeyed your locale rules and incorrect
command line), but in your command line (because you didn't tell sort
where to quit parsing numbers).
I'm going to distill it down to a smaller input that still expresses the
same "swapped" lines:
$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
| sort -t, -k1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,73,67,6
_________
_________
2,68,61,7
_________
_________
1,69,55,14
__________
__________
2,71,59,12
__________
__________
See what's happening? The -k1n argument says to start parsing at field
1, but continue parsing until either the input is no longer numeric or
until the end of line is reached (even if it goes into field 2 or
beyond). Since commas are silently ignored in the en_US.UTF-8 locale
when parsing a number, sort is thus comparing the values 268617 and
1695514, and the sort was correct.
Now, try telling sort that it must parse a numeric field, but must END
the parse at the end of the first field (if not sooner due to end of
number):
$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
| sort -t, -k1,1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________
Or try using a locale where ',' is NOT part of a valid number:
$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
| LC_ALL=C sort -t, -k1n --debug
sort: using simple byte comparison
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________
>
> This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t ,
> -k 1n
Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring
from the stackoverflow site to your email). But yeah, when you truncate
to a smaller number, you are comparing different values (17367 is less
than 26861).
>
> Using 'g' instead of 'n' also produces the expected results, but I'm not
> clear on what the difference is between 'g' and 'n'.
-n is specified by POSIX as parsing integers according to the current
locale's definition. -g is a GNU extension, which says to parse
floating point numbers. Apparently, in the en_US.UTF-8 locale, parsing
floating point stops at the first comma, while parsing integers does not:
$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
| sort -t, -k1g --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________
I don't know why libc chose to make strtoll() ignore commas while
strtold() does not, when not in the C locale.
But at any rate, I hope I've demonstrated that the bug was in your usage
and not in sort. So I'm closing this bug, although you should feel free
to add further comments or questions. You may also want to read the FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
[Hmm - we should update that FAQ to mention the --debug option]
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org