bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#36674: Sort Suggestion


From: Assaf Gordon
Subject: bug#36674: Sort Suggestion
Date: Mon, 15 Jul 2019 13:23:52 -0600
User-agent: Mutt/1.11.4 (2019-03-13)

tag 36674 notabug
close 36674
stop

Hello,

On Mon, Jul 15, 2019 at 11:42:01AM -0700, Marshall Lake wrote:
> Even though this isn't a bug, I was asked to send the following to this
> email address.

(General suggestions and discussions are better suited for
address@hidden mailing list, that way the system won't open a new
bug item.)

> 
> Re:  SORT Command from GNU coreutils 8.25
> 
> A suggestion for an additional option to the SORT command is to ignore
> non-alphanumeric characters.
> 
> As an example, in attempting to sort an index ...
> 
> Abbott, William                        259
> 
> sorts before:
> 
> Abbot, William                         099
> 
> If non-alphanumeric characters were ignored then the same two records
> would sort as:
> 
> Abbot, William                         099
> Abbott, William                        259
> 
> 

There's actually something else at play here:
In your case, sort does ignore non-alphanumeric characters,
but it ALSO ignores white space.
That happens because your locale is set to some language
(for example, en_US.UTF8).

Using such locale makes sort ignore all non-alphanumeric chareacters,
whitespace, and upper/lower cases.

In essense, you are compaing "AbbottWilliam" (two 't's) to
'AbbotWilliam' (one 't') - and then the second 't' is compared to a 'w',
and is determined to come first.

If you force a POSIX/C locate, then all characters are considered,
and the result will be as you requested.

Observe the following:

  $ printf "%s\n" AbbottWilliam AbbotWilliam | LC_ALL=en_CA.utf8 sort
  AbbottWilliam
  AbbotWilliam

  $ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=en_CA.utf8 sort
  Abbott William
  Abbot William

  $ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=C sort
  Abbot William
  Abbott William

  $ printf "%s\n" "Abbott, William" "Abbot, William" | LC_ALL=C sort
  Abbot, William
  Abbott, William

Note that 'sort' already has an option for dictionary style sorting:
   -d, --dictionary-order: consider only blanks and alphanumeric characters.

However, locale rules take precedence over it, so effectively it only
works in "C" locale:

  $ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort
  Ab,,b,,ott William
  Abbot William

  $ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort -d
  Abbot William
  Ab,,b,,ott William


You can read past discussion about the confusion resulting from locale
sorting rules here:
   https://debbugs.gnu.org/11621
   https://debbugs.gnu.org/12783


As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.

-assaf






reply via email to

[Prev in Thread] Current Thread [Next in Thread]