bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#8067: sort fails to sort completely, due to "similar" keys.


From: Bob Harris
Subject: bug#8067: sort fails to sort completely, due to "similar" keys.
Date: Thu, 17 Feb 2011 15:46:16 -0500

Howdy,

(note: I know I should give you version information with this, but (1) I am not sure that this message will be read by anyone, and (2) I think the problem probably transcends versions. If I get a response and the actual version is important, I will take the time to find it.)

I have a file of genomic short sequence info in which it so happens that two of my sort key values are similar. The two keys are
        HWI-ST407_110127_0082_A80L25ABXX:5:2:11746:46371#0/1
        HWI-ST407_110127_0082_A80L25ABXX:5:21:17464:6371#0/1
As you can see, these are identical if one removes the colons.

Unfortunately, I have a file with something on the order of 4 million lines, and there are roughly a dozen lines with each of these keys. I am using sort with the intent of collecting the lines for each key together. (I don't really care about ordering, I just need to group lines with the same key together to facilitate downstream processing). The unfortunate part is that sort considers the two keys as equal. And so it fails to create the grouping I need.

I have tried several different options but none seem to work. -d seems to be the default, and it has the behavior indicated above. -n fails completely. -g also fails. Reading the man page, I don't see any other options to control the comparison function. I have also tried massaging my file prior to piping into sort, replacing colons with other characters (e.g. underscore or tilde) but with no success.

I understand *why* -d considers these two keys equal. What I don't understand is why there is no option that says "order them lexicographically".

Is there a hidden sort option that will do what I need?

About the only way I can think to force sort to actually sort on such a key is to pre-process the file and replace the keys with a hash code (rendered with nothing but A-Z). But this introduces additional issues, such as maintaining a table so I can convert the keys back after sorting, and making sure my hash is unique, etc. etc.

I'm pretty sure I'm not the first person to run into this problem.

Thanks for any help or advice.
Bob H





reply via email to

[Prev in Thread] Current Thread [Next in Thread]