bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)


From: Pádraig Brady
Subject: bug#51011: [GNU sort] Numerical sort with delimiter may be broken (bug)
Date: Fri, 8 Oct 2021 14:37:42 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Thunderbird/84.0

On 04/10/2021 21:01, Paul Eggert wrote:
On 10/4/21 08:58, Pádraig Brady wrote:
The --debug option points out the issue:

    $ printf '%s\n' 1,a 0,9 | sort --debug -nk1 -t ,
    sort: key 1 is numeric and spans multiple fields
    1,a
    _
    ___
    0,9
    ___
    ___

As Juncheng points out, it is a bit odd that -n and -g disagree here,
even in locales where ',' is not a decimal point. For example:

$ printf '1,a\n0,9\n' | sort -gk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
0,9
_
___
1,a
_
___
$ printf '1,a\n0,9\n' | sort -nk1 -t, --debug
sort: text ordering performed using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,a
_
___
0,9
___
___

The difference here is due to ',' being treated as a thousands sep,
not a decimal point. So Juncheng to specifically answer your question,
0,9 is being interpreted as 9, which sorts after 1,a. For e.g. consider:

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1g --debug
0,900
_
1,a
_

$ printf '%s\n' 1,a 0,900 | sort -s -k1,1n --debug
1,a
_
0,900
_____


Given the various groupings possible (depending on locale
one can group in 2, 3, 4, 5 digits) we effectively just
ignore the grouping separator in numeric mode, hence the difference.

Note in locales where , is a decimal point we do get
consistent order between -g and -n as expected:

$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__
$ printf '%s\n' '1,a' '0,9' | LC_ALL=fr_FR.utf8 sort -s -k1,1g --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0,9
___
1,a
__

For completeness we do have another issue with grouping separators,
where we don't support multi-byte separators appropriately.
For e.g. fr_FR.utf8 uses "narrow non breaking space" as the separator,
which we don't support:

$ sep=$(LC_ALL=fr_FR.utf8 locale thousands_sep)
$ printf '%s\n' 0800 "0${sep}900" | LC_ALL=fr_FR.utf8 sort -s -k1,1n --debug
sort: tri du texte réalisé en utilisant les règles de tri « fr_FR.utf8 »
0 900
_
0800
____


cheers,
Pádraig





reply via email to

[Prev in Thread] Current Thread [Next in Thread]