[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#7878: "sort" bug--inconsistent single-column sorting influenced by o
From: |
Bob Proulx |
Subject: |
bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? |
Date: |
Thu, 20 Jan 2011 23:02:11 -0700 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
Randall Lewis wrote:
> "sort" does inconsistent sorting.
You are sure about that? :-)
> I'm pretty sure it has NOTHING to do with the following warning,
> although I could be totally wrong.
>
> " *** WARNING ***
> The locale specified by the environment affects sort order.
> Set LC_ALL=C to get the traditional sort order that uses
> native byte values. "
You read this, know that sort will base the sorting upon the locale
setting, but didn't tell us what locale you were using to sort? Shame
on you. Because you *know* I am going to ask you about it! :-)
What locale are you using? C? en_US.UTF-8? Some other? The locale
command will print this information. Here is an example from my system.
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
> sort test1.txt
> 323|1
> 36|2
> 40|4
> 406|3
> 587|5
> sort test7.txt
> 323|B1
> 36|C2
> 406|B3
> 40|B4
> 587|C5
Looks okay to me for the en_US.UTF-8 locale. But it will of course be
different in the C locale.
$ LC_ALL=en_US.UTF-8 sort test1.txt
323|1
36|2
40|4
406|3
587|5
$ LC_ALL=C sort test1.txt
323|1
36|2
406|3
40|4
587|5
What ordering did you expect there? I assume you are expecting to see
these sorted as in the C locale?
> The rows are in a different order depending on the dataset--and it
> is NOT a numeric sort. I'm not even sure it is is ANY type of sort.
It is a character sort. A string sort. It is comparing the line of
characters from start to finish. But it uses the system's collation
tables based upon the locale. In the en_US.UTF-8 locale punctuation
is ignored and case is folded. I don't like it but the powers that be
have decreed it.
Please see the FAQ:
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021
The standards documentation:
http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html
Variables that control localization:
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02
> sort -k1 -t "|" test1.txt
Hint: If you ever think you need to use -k POS1 then you almost always
should be using -k POS1,POS2 to specify where you want the sort to
stop comparing. Otherwise it compares all of the way to the end of
the line.
> But why did it sort inconsistently in the first place based on the
> other contents of the file rather than just focusing on the first
> column--even when I told it to?
You never told it not to continue comparing all of the way to the end
of the line. For example this way:
$ sort -t'|' -k1,1n -k2,2n test1.txt
36|2
40|4
323|1
406|3
587|5
That won't help you with join since that expects a non-numeric sort
ordering.
> Inconsistent sorting when combined with 'join' provides incorrect
> matches and duplication of records. This is a mess.
Yes. Recent versions of join detect and warn about this. Recent
versions of sort have a --debug option that can help to identify
problem cases.
Bob