bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#23012: add option to specific locale to sort


From: John Heidemann
Subject: bug#23012: add option to specific locale to sort
Date: Mon, 14 Mar 2016 13:12:12 -0700

On Mon, 14 Mar 2016 11:39:37 -0700, Paul Eggert wrote: 
>On 03/14/2016 11:02 AM, John Heidemann wrote:
>> A test case that exhibits locale-specific oddness, with current sort:
>>
>> ... LC_COLLATE=C sort ...
>>
>> And the happeniess that ensues from control without environment variables:
>>
>> ... sort --locale=C ...
>>
>
>I dunno, these approaches seem about the same to me. And 'LC_ALL=C
>sort' is standardized whereas 'sort --locale=C' is not, which is a
>significant advantage for portable scripts. And if we added the
>--locale option to 'sort', for consistency we'd need to add it to
>uniq, awk, grep, etc., etc., and document all this, and explain why
>there are two ways to specify the same thing and that one overrides
>the other, etc., etc. Is the minor benefit worth all this hassle?

0. I would suggest that sort has a problem, as shown by the comment in
the code and a large FAQ entry (two of them: one for sort and one for
ls).  People are confused---it would be nice to change something to
reduce confusion.


You're right that there are two questions:

1- is the API with arguments any better
2- does it need to be uniform across all utilities?


1. About arguments, environment variables vs. CLI approaches are quite 
different.


In a shell script, you're right that

LC_COLLATE=C sort

vs

sort --locale=C

are about the same.

Except, one might instead put LC_COLLATE elsewhere in the script

export LC_COLLATE=C
# 100 lines of shell
sort
# now with correct behavior depending on a global variable 100 lines ago


Things look even more different from C, where it is setenv("LC_COLLATE",
"C", 1); vs. arguments to execve.


Is CLI *better*?   I suggest slightly better, but not that much better
(by itself).

Where it wins (I think) is that it is more regular with how other things
are done with sort.  It would appear in the man page.

(By the way: if you don't take the patch, (a) I encourage you to copy the
text about LC_COLLATE from the info page to the manual page, and (b) it
looks like (in the code) the monetary aspects of locale also affects
sorting.   That is not mentioned in either info or man.  These changes
might address some of the confusion raised in #0 above.)


2. does it have to be across all utilities?

Maybe in the fullness of time.  Or maybe not.

For me, sort is particularly important because some apps depend on
its output.  In other tools (like ls, from the FAQ), output order
doesn't usually affect correctness.

The specific use case that led me here is that Hadoop wants sorted
input to the reduce phase.  By default it uses a Java-based sort with
sort(1)-style arguments.  However, it ignores locale.  To be compatible,
one must run GNU sort with LC_COLLATE=C, and figuring that out is not at
all obvious.


   -John Heidemann






reply via email to

[Prev in Thread] Current Thread [Next in Thread]