coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Request for feature, sort


From: David Mathog
Subject: Re: Request for feature, sort
Date: Mon, 05 Nov 2018 09:34:48 -0800
User-agent: Roundcube Webmail/0.9.5

On 04-Nov-2018 00:45, Pádraig Brady wrote:
I'm not sure how frequent this partially sorted data pattern is,
though given the complexity improvements it seems worthwhile.

It is extremely common in Biology applications. For instance, search a set of DNA sequences (with pre-ordered names like contig_1 -> contig_N, or frequently just 1 -> N) against a target database and the output will (for most applications) have clusters of results for 1, then 2, ... then N.

There are some programs like that which when run multithreaded would emit those same blocks of results but with the block order shuffled over a window corresponding to the number of threads. That is, they just emit each block when it finishes without doing anything to keep the blocks in order. Taking advantage of "block order", but not "absolute order", in sort would require yet another key modifier. While this is a very closely related problem I'm not sure it is common enough to warrant the extra modifier. That said, I think the only modification needed from an "absolute order" implementation to a "block order" one is that the program not consider an out of sorted order key change to be an error.


So 'p' would be a modifier applicable to each key.
When the data in the key changes we could do
the full key comparison to ensure the order is correct,
and if so output all current sorted data for the just presorted key.
'p' modified keys would need to come before any non 'p' keys.

Exactly.

Conceivably there might be situations where the input file was already sorted on keys 1 and 3, but not on key 2. However, retaining the order information for the 3rd key while sorting the 2nd is likely more effort than it is worth. Certainly diminishing returns set in very quickly in that arena - consider how difficult it would be to retain the order information for "-k 1,1 -k 2,2 -k 3,3p" for an input file of a billion lines. Anyway, requiring all "p" keys to precede all non "p" keys is not an onerous restriction.

Regards,

David Mathog
address@hidden
Manager, Sequence Analysis Facility, Biology Division, Caltech



reply via email to

[Prev in Thread] Current Thread [Next in Thread]