[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: how to speed up sort for partially sorted input?
From: |
Carl Edquist |
Subject: |
Re: how to speed up sort for partially sorted input? |
Date: |
Thu, 12 Aug 2021 16:52:16 -0500 (CDT) |
User-agent: |
Alpine 2.23.2 (DEB 496 2020-07-19) |
On Wed, 11 Aug 2021, pengyu.ut@gmail.com wrote:
Yes, this involves many calls of the coreuils' sort, which is not
efficient. Would it make sense to add an option in sort so that sort can
sort a partially sorted input in one shot.
Might be nice! Not sure how convenient that would be to work into the
existing sort code. Or how much of an overall speedup would actually be
realized.
Just curious, what size data sets are you working with? (Number of lines
in the input file (N), and number of distinct values (M) for column 1?
[for fun i'll pretend to do some "math" here]
Just in terms of complexity, if the sorting work (without the partial sort
of column 1) is ~ N * log(N), then for the optimized partial sort it would
be ~ N * log(N/M).
That means the "speedup" (in terms of total sorting work) is ~
log(N)/log(N/M).
More generally, if N = M**P, the speedup is P/(P-1).
Or (equivalently) if you have N = 2**N' and M = 2**M', the speedup is
N'/(N'-M').
So if you have N = M*M (eg, 64k lines with 256 distinct values for col 1)
the theoretical speedup is ~2x. And N = M*M*M (eg, 16M lines and 256
distinct values) gives only a 1.5x speedup.
Though if N and M are relatively closer, eg 1M lines and 64k distinct
values (so, chunks of 16 lines each), the speedup is 5x.
Nevertheless, there is probably an overhead cost that's perhaps linear
with M. (Though its weight depends on implementation...)
I'll add that, if your "sort by column 1, then column 2" requirement is
also satisfied just by sorting whole lines(*), then asking sort to parse
the lines into columns is extra work and you may actually make the whole
thing slower even if there is less to compare.
(I mean here "sort" vs "sort [-s] -k1,1 -k2,2")
(*) not necessarily the case, depending on how the column separators sort
relative to non-separator characters
Anyway i'd be curious how big of an actual speedup (even theoretically,
which might be optimistic) you might have in mind here. How many seconds
might actually be saved, for how long of a run? And is that projected
savings enticing enough to inspire implementing it?
(My guess is if someone wanted to add it though, the interface might be a
new option, effectively "assume sorted", which could be applied to the
OPTS for the KEYDEF specified for -k/--key ...)
Carl