bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uniq/sort documentation flaw


From: Pádraig Brady
Subject: Re: uniq/sort documentation flaw
Date: Tue, 5 May 2009 12:13:04 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Andries E. Brouwer wrote:
> uniq(1) says
> 
>        Discard all but one of successive identical lines from INPUT
> 
> However, this is very misleading. "Identical" does not mean identical
> but "equal if one ignores differences that LC_COLLATE says should be ignored".
> 
> This man page line should be changed, adding a reference to the locale.
> As it is now, the words locale and LC_COLLATE do not occur on the man page.
> 
> The info file is better and mentions LC_COLLATE.
> But also there the fact that the meanings of "repeated" and "duplicate"
> are modified by LC_COLLATE is not mentioned explicitly.
> 
> Andries

How about the attached?

> (Sorting is an operation done on all kinds of data, not only lines of text.
> I would not mind an option that tells sort to ignore the locale rules for
> sorting because what is sorted is not text. That feels cleaner than
> preceding each invocation with LC_COLLATE=C. And locale-free sort also
> is much faster.)

Well it is a very common issue.
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
I'm not sure there is a better solution than what we have though.

cheers,
Pádraig.
>From 14d5f083fc6ed571ca0c07e51e7d4365c1ddcd91 Mon Sep 17 00:00:00 2001
From: =?utf-8?q?P=C3=A1draig=20Brady?= <address@hidden>
Date: Tue, 5 May 2009 12:00:15 +0100
Subject: [PATCH] doc: note the use of LC_COLLATE in comm, join and uniq.

* doc/coreutils.texi (uniq invocation): Simplify the
text to remove the inconsequential mentioning of order,
while implying that LC_COLLATE can alter equality comparisons.
* src/comm.c (usage): Mention LC_COLLATE is significant.
* src/join.c (usage): Ditto
* src/uniq.c (usage): Ditto. Also improve the summary.
Suggestion from Andries Brouwer
---
 doc/coreutils.texi |    4 ++--
 src/comm.c         |    4 ++++
 src/join.c         |    1 +
 src/uniq.c         |    7 +++++--
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 918f44e..b96fdb2 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4406,8 +4406,8 @@ duplicate lines, perhaps you want to use @code{sort -u}.
 @xref{sort invocation}.
 
 @vindex LC_COLLATE
-Comparisons use the character collating sequence specified by the
address@hidden locale category.
+Comparisons honor the rules specified by the @env{LC_COLLATE}
+locale category.
 
 If no @var{output} file is specified, @command{uniq} writes to standard
 output.
diff --git a/src/comm.c b/src/comm.c
index c60936f..3c5b09a 100644
--- a/src/comm.c
+++ b/src/comm.c
@@ -129,6 +129,10 @@ and column three contains lines common to both files.\n\
 "), stdout);
       fputs (HELP_OPTION_DESCRIPTION, stdout);
       fputs (VERSION_OPTION_DESCRIPTION, stdout);
+      fputs (_("\
+\n\
+Note, comparisons honor the rules specified by `LC_COLLATE'.\n\
+"), stdout);
       emit_bug_reporting_address ();
     }
   exit (status);
diff --git a/src/join.c b/src/join.c
index 992a357..c716698 100644
--- a/src/join.c
+++ b/src/join.c
@@ -204,6 +204,7 @@ separated by CHAR.\n\
 \n\
 Important: FILE1 and FILE2 must be sorted on the join fields.\n\
 E.g., use `sort -k 1b,1' if `join' has no options.\n\
+Note, comparisons honor the rules specified by `LC_COLLATE'.\n\
 If the input is not sorted and some lines cannot be joined, a\n\
 warning message will be given.\n\
 "), stdout);
diff --git a/src/uniq.c b/src/uniq.c
index a3e0fb7..f9b4342 100644
--- a/src/uniq.c
+++ b/src/uniq.c
@@ -135,8 +135,10 @@ Usage: %s [OPTION]... [INPUT [OUTPUT]]\n\
 "),
              program_name);
       fputs (_("\
-Discard all but one of successive identical lines from INPUT (or\n\
-standard input), writing to OUTPUT (or standard output).\n\
+Filter adjacent matching lines from INPUT (or standard input),\n\
+writing to OUTPUT (or standard output).\n\
+\n\
+With no options, matching lines are merged to the first occurence.\n\
 \n\
 "), stdout);
      fputs (_("\
@@ -170,6 +172,7 @@ characters.  Fields are skipped before chars.\n\
 \n\
 Note: 'uniq' does not detect repeated lines unless they are adjacent.\n\
 You may want to sort the input first, or use `sort -u' without `uniq'.\n\
+Also, comparisons honor the rules specified by `LC_COLLATE'.\n\
 "), stdout);
       emit_bug_reporting_address ();
     }
-- 
1.5.3.6


reply via email to

[Prev in Thread] Current Thread [Next in Thread]