bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#7878: "sort" bug--inconsistent single-column sorting influenced by o


From: Bob Proulx
Subject: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns?
Date: Thu, 20 Jan 2011 23:02:11 -0700
User-agent: Mutt/1.5.20 (2009-06-14)

Randall Lewis wrote:
> "sort" does inconsistent sorting.

You are sure about that?  :-)

> I'm pretty sure it has NOTHING to do with the following warning,
> although I could be totally wrong.
> 
> " *** WARNING ***
> The locale specified by the environment affects sort order.
> Set LC_ALL=C to get the traditional sort order that uses
> native byte values. "

You read this, know that sort will base the sorting upon the locale
setting, but didn't tell us what locale you were using to sort?  Shame
on you.  Because you *know* I am going to ask you about it! :-)

What locale are you using?  C?  en_US.UTF-8?  Some other?  The locale
command will print this information.  Here is an example from my system.

  $ locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_COLLATE=C
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_PAPER="en_US.UTF-8"
  LC_NAME="en_US.UTF-8"
  LC_ADDRESS="en_US.UTF-8"
  LC_TELEPHONE="en_US.UTF-8"
  LC_MEASUREMENT="en_US.UTF-8"
  LC_IDENTIFICATION="en_US.UTF-8"
  LC_ALL=

> sort test1.txt
> 323|1
> 36|2
> 40|4
> 406|3
> 587|5

> sort test7.txt
> 323|B1
> 36|C2
> 406|B3
> 40|B4
> 587|C5

Looks okay to me for the en_US.UTF-8 locale.  But it will of course be
different in the C locale.

  $ LC_ALL=en_US.UTF-8 sort test1.txt 
  323|1
  36|2
  40|4
  406|3
  587|5

  $ LC_ALL=C sort test1.txt 
  323|1
  36|2
  406|3
  40|4
  587|5

What ordering did you expect there?  I assume you are expecting to see
these sorted as in the C locale?

> The rows are in a different order depending on the dataset--and it
> is NOT a numeric sort. I'm not even sure it is is ANY type of sort.

It is a character sort.  A string sort.  It is comparing the line of
characters from start to finish.  But it uses the system's collation
tables based upon the locale.  In the en_US.UTF-8 locale punctuation
is ignored and case is folded.  I don't like it but the powers that be
have decreed it.

Please see the FAQ:

  
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

The standards documentation:

  http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html

Variables that control localization:

  
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02

> sort -k1 -t "|" test1.txt

Hint: If you ever think you need to use -k POS1 then you almost always
should be using -k POS1,POS2 to specify where you want the sort to
stop comparing.  Otherwise it compares all of the way to the end of
the line.

> But why did it sort inconsistently in the first place based on the
> other contents of the file rather than just focusing on the first
> column--even when I told it to?

You never told it not to continue comparing all of the way to the end
of the line.  For example this way:

  $ sort -t'|' -k1,1n -k2,2n test1.txt 
  36|2
  40|4
  323|1
  406|3
  587|5

That won't help you with join since that expects a non-numeric sort
ordering.

> Inconsistent sorting when combined with 'join' provides incorrect
> matches and duplication of records. This is a mess.

Yes.  Recent versions of join detect and warn about this.  Recent
versions of sort have a --debug option that can help to identify
problem cases.

Bob





reply via email to

[Prev in Thread] Current Thread [Next in Thread]