bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#7878: "sort" bug--inconsistent single-column sorting influenced by o


From: Bob Proulx
Subject: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns?
Date: Fri, 21 Jan 2011 02:45:02 -0700
User-agent: Mutt/1.5.20 (2009-06-14)

Hi Randall,

Randall Lewis wrote:
> Wow! So, a couple comments about how I seem to have figured out
> every wrong way to use "sort" when also using "join."

You did have an impressive number of cases examined!

> Who would've thought that 
>
> sort -k1 test1.txt
> 
> would default to sort on the entire line? (I normally would've
> thought that [,POS2] means "optional if you want to have it keep
> going beyond the first field.")

You are not the only one to have had that misconception.  But that is
the way that it has always worked.  Here is the GNU sort documentation.

  `-k POS1[,POS2]'
  `--key=POS1[,POS2]'
       Specify a sort field that consists of the part of the line between
       POS1 and POS2 (or the end of the line, if POS2 is omitted),
       _inclusive_.

This behavior goes back at least to Unix v7 days and actually very
likely well before that time.  When you are a programmer in the middle
1970's writing a sorting program and you make a simple decision about
how to control sorting using command line arguments would you have had
any idea that in 2011 we would still be using virtually the same
program and interface forty years later?  And you are working on the
problem for what amounts to the first time on a new operating system.
Having done interface design and having been less successful I can't
complain.  :-)  Some of the decisions were less than great.  Other
decisions were excellent and visionary.  On average they were better
than most of us can do on our best days.

> Also, who would've thought that the default "sort" would be
> incompatible with "join" and that you would need to write the
> command like this every time you wanted to use "join"?

When sort and join were written they were compatible.  Back then the
collation sequence was strictly byte ordering.  That is the standard C
locale ordering.

It wasn't until recently when locales were introduced with en_US and
similar that problems were introduced.  For reasons unfathomable to me
the powers that be made sort ordering dictionary ordering where case
is folded and punctuation is ignored.  They failed to see how this
would negatively impact almost everything.  Creeping features.
Because punctuation is ignored in the en_US locale it causes a lot of
problems.  You didn't have to say LC_ALL=C for the first thirty years.
Don't get me started.  I have been a rather outspoken critic of this
design decision.

Personally I have the following set in my shell environment.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

I want the traditional collation sequence and so set LC_COLLATE.  But
I also want the fancy new characters with umlauts and that requires
(along with a unicode charset) a UTF-8 capable locale.  The above is a
compromise but for me a good one.

> LC_ALL=C sort test1.txt
> 
> Or that you would need a special type of "pre-sort" on the column
> (which I was executing wrong)?
> 
> sort -k1,1 -t "|" test1.txt

Since you had two fields you probably want to sort on the second field too.

  sort -k1,1 -k2,2 -t "|" test1.txt

That will sort on the first field and then the second field.

> Regardless, here is "locale" (for the record, I'm pretty new to the
> utilities--and love them. I'm not a computer scientist, but rather
> an economist trying to fit in at Yahoo! with the engineers and
> computer scientists). I'm sure there's a good reason why there are
> two, and it's pretty clear that I novice enough that I'll have to
> learn that later.

I didn't follow where the "two" was attached.  Two as in economists
and computer scientists?  Or two as in engineers and computer
scientists?  Full disclosure: I am an electrical engineer. :-)

> Thanks, Bob, for sharing two separate ways that I could get the
> answer the way I need it--two ways I could not have come up with on
> my own.

Just to nudge in a particular direction there are two other mailing
lists that are good to know about.  The address@hidden mailing list
is for general discussion of the coreutils.  Here on bug-coreutils is
where bug reports are collected every message thread opens a bug
ticket in the bug tracking system.  Which is great for bug reports.
But not so good for general discussion since it keeps opening bugs
that need to be triaged.  That is why we have the coreutils mailing
list which is just a normal list for normal discussion.  Additionally
there is a general discussion list for general help
address@hidden that is also a good resource.

> P.S. So, the reason why sorting on the column didn't work for me was
> because it was plucking out the delimiter and then doing a string
> sort? 

Correct.

> Then it was string sorting, putting numbers before letters (as
> you might expect it to)?

It would look like this to sort:

  $ sed 's/[[:punct:]]//' test1.txt 
  3231
  362
  4063
  404
  5875

  $ sed 's/[[:punct:]]//' test1.txt | LC_ALL=C sort
  3231
  362
  404
  4063
  5875

> 323|1
> 36|2
> 406|3
> 40|7 <-- Changed from 4 to 7 changed the sort order.
> 587|5

  $ sed 's/[[:punct:]]//' test1.txt | LC_ALL=C sort
  3231
  362
  4063
  407
  5875

And case is folded too.  But that didn't come into play here.  And
this affects everything that sorts everywhere on the system.
Including the shell.

  echo *
  for f in *; do ...
  ls

Bob





reply via email to

[Prev in Thread] Current Thread [Next in Thread]