coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: BUG in sort --numeric-sort --unique


From: Kaz Kylheku (Coreutils)
Subject: Re: BUG in sort --numeric-sort --unique
Date: Thu, 13 Feb 2020 15:32:35 -0800
User-agent: Roundcube Webmail/0.9.2

On 2020-02-13 14:00, Stefano Pederzani wrote:
In fact, separating the parameters:
# cat controllareARCHIVIO_2020/02/controllare20200213.txt | sort -u |
sort -n | wc -l
1262
we workaround the bug.

My own experiment shows confirms things to be reasonable.

When -n and -u are combined, then uniqueness is based no numeric
equivalence. Since numeric equivalence is weaker, de-duplication
based on numeric equivalence can cull out more records than
de-duplication based on textual equivalence.

$ printf "0\n00\n000\n" | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -n
0
00
000
$ printf "0\n00\n000\n" | sort -nu
0
$ printf "0\n00\n000\n" | sort -n | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -u | sort -n
0
00
000

As you can see, sort -nu is not equivalent to any combination
of sort -n and sort -u.   sort -nu has de-duplicated a file of
different "spellings" of zero down to a single entry.

sort -u may not de-duplicate these entries because "0"
is textually different from "00".

Every line is only something like "1.2.3.4".

Unfortunately, "sort -n" will probably not do what you think with
this data.

Please read sort's GNU Info documentation; the man page lacks
detail about what numeric sorting means.

Also, the POSIX standard's description of -n:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

In short, what -n does is recognize a *prefix* of each line as a number
according to a pattern that includes optional blanks, an optional sign,
digits, a radix character, and digit group separators.

-n does not deal with compound numeric identifiers like 1.2.3.4.

Basically 1.2.3.4 and 1.2.4.4 both look like the number 1.2.

$ sort -nu
1.2.3.4
1.2.4.4
1.2.5.6
[Ctrl-D][Enter]
1.2.3.4

Oops! This result is correct; under numeric sort (-n), all these lines
are considered to have the key 1.2. And if we de-duplicatd based on that,
they are all considered to be duplicates; they de-duplicate down to
a single line.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]