coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?


From: Carl Edquist
Subject: Re: how to speed up sort for partially sorted input?
Date: Thu, 12 Aug 2021 16:52:16 -0500 (CDT)
User-agent: Alpine 2.23.2 (DEB 496 2020-07-19)

On Wed, 11 Aug 2021, pengyu.ut@gmail.com wrote:

Yes, this involves many calls of the coreuils' sort, which is not efficient. Would it make sense to add an option in sort so that sort can sort a partially sorted input in one shot.

Might be nice! Not sure how convenient that would be to work into the existing sort code. Or how much of an overall speedup would actually be realized.


Just curious, what size data sets are you working with? (Number of lines in the input file (N), and number of distinct values (M) for column 1?

[for fun i'll pretend to do some "math" here]

Just in terms of complexity, if the sorting work (without the partial sort of column 1) is ~ N * log(N), then for the optimized partial sort it would be ~ N * log(N/M).

That means the "speedup" (in terms of total sorting work) is ~ log(N)/log(N/M).

More generally, if N = M**P, the speedup is P/(P-1).

Or (equivalently) if you have N = 2**N' and M = 2**M', the speedup is N'/(N'-M').

So if you have N = M*M (eg, 64k lines with 256 distinct values for col 1) the theoretical speedup is ~2x. And N = M*M*M (eg, 16M lines and 256 distinct values) gives only a 1.5x speedup.

Though if N and M are relatively closer, eg 1M lines and 64k distinct values (so, chunks of 16 lines each), the speedup is 5x.

Nevertheless, there is probably an overhead cost that's perhaps linear with M. (Though its weight depends on implementation...)


I'll add that, if your "sort by column 1, then column 2" requirement is also satisfied just by sorting whole lines(*), then asking sort to parse the lines into columns is extra work and you may actually make the whole thing slower even if there is less to compare.
(I mean here "sort" vs "sort [-s] -k1,1 -k2,2")

(*) not necessarily the case, depending on how the column separators sort relative to non-separator characters


Anyway i'd be curious how big of an actual speedup (even theoretically, which might be optimistic) you might have in mind here. How many seconds might actually be saved, for how long of a run? And is that projected savings enticing enough to inspire implementing it?


(My guess is if someone wanted to add it though, the interface might be a new option, effectively "assume sorted", which could be applied to the OPTS for the KEYDEF specified for -k/--key ...)


Carl



reply via email to

[Prev in Thread] Current Thread [Next in Thread]