bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: problems with 'join' command


From: Eric Blake
Subject: Re: problems with 'join' command
Date: Thu, 31 Jan 2008 19:56:43 -0700
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071031 Thunderbird/2.0.0.9 Mnenhy/0.7.5.666

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Samir Wadhawan on 1/31/2008 2:41 PM:
| Dear Mike Haertel,

Coreutils is maintained by more than just Mike (for that matter, it has
been years since Mike made any contributions, according to the ChangeLog).

| As indicated in the join's manpage, we ensured that the columns on
| which the join was being produced were sorted using these commands before
| the join was conducted:
|
| sort -k 5 file1 > file1.srt
| sort -k 1 file2 > file2.srt
|
| Surprisingly we notice that join proceeds WITHOUT errors when we use this
| variant of sort:
|
| sort -k 5,5 file1 > file1.srt
| sort -k 1,1 file2 > file2.srt

Thanks for the report, however, this is probably not a bug, but a locale
issue.  "sort -k 5 file1" is different than "sort -k 5,5 file1".  One
sorts by characters starting in the fifth field, and going to the end of
the line, while the other sorts only by the fifth field.  Depending on
your current LC_COLLATE settings, this may be significant.  In both cases,
since your input file1 had repeats in field 5, it means that sort must
fall back on the entire line to resolve lines that otherwise compare
equal.  Also, since you didn't use -b for sort, the leading blanks figure
into the key, which may impact which lines compare equal.

|
| Clearly, the only difference between the above two variants of sort
command is
| the additional sorting order of the columns following the ones on which the
| sort is being generated. This behaviour puzzles us as the join seems to be
| producing
| different (inconsistent) outputs, and appears to be sensitive to the sorting
| order of other columns in the file.

Join can only produce consistent outputs if the input is consistent; it
appears that your sorting is not consistent enough for join's purposes,
for your given locale settings.

|
| We tried to reproduce this behaviour on an AIX machine, but find that
| both the variants of sorted files produces consistent
| join results.

Most likely because it had different locale settings.
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

- --
Don't work too hard, make some time for fun as well!

Eric Blake             address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHoopr84KuGfSFAYARAjZMAJ46DpbuO5BTE3+ajTQIgGuoahwCFgCeMEn3
KFIq50tdYkD3zkPrhKBu/hg=
=xmsx
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]