Re: how to speed up sort for partially sorted input?

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?

From:	Carl Edquist
Subject:	Re: how to speed up sort for partially sorted input?
Date:	Thu, 12 Aug 2021 16:52:16 -0500 (CDT)
User-agent:	Alpine 2.23.2 (DEB 496 2020-07-19)

On Wed, 11 Aug 2021, pengyu.ut@gmail.com wrote:

Yes, this involves many calls of the coreuils' sort, which is notefficient. Would it make sense to add an option in sort so that sort cansort a partially sorted input in one shot.

Might be nice! Not sure how convenient that would be to work into theexisting sort code. Or how much of an overall speedup would actually berealized.

Just curious, what size data sets are you working with? (Number of linesin the input file (N), and number of distinct values (M) for column 1?


[for fun i'll pretend to do some "math" here]

Just in terms of complexity, if the sorting work (without the partial sortof column 1) is ~ N * log(N), then for the optimized partial sort it wouldbe ~ N * log(N/M).

That means the "speedup" (in terms of total sorting work) is ~log(N)/log(N/M).


More generally, if N = M**P, the speedup is P/(P-1).

Or (equivalently) if you have N = 2**N' and M = 2**M', the speedup isN'/(N'-M').

So if you have N = M*M (eg, 64k lines with 256 distinct values for col 1)the theoretical speedup is ~2x. And N = M*M*M (eg, 16M lines and 256distinct values) gives only a 1.5x speedup.

Though if N and M are relatively closer, eg 1M lines and 64k distinctvalues (so, chunks of 16 lines each), the speedup is 5x.

Nevertheless, there is probably an overhead cost that's perhaps linearwith M. (Though its weight depends on implementation...)

I'll add that, if your "sort by column 1, then column 2" requirement isalso satisfied just by sorting whole lines(*), then asking sort to parsethe lines into columns is extra work and you may actually make the wholething slower even if there is less to compare.

(I mean here "sort" vs "sort [-s] -k1,1 -k2,2")

(*) not necessarily the case, depending on how the column separators sortrelative to non-separator characters

Anyway i'd be curious how big of an actual speedup (even theoretically,which might be optimistic) you might have in mind here. How many secondsmight actually be saved, for how long of a run? And is that projectedsavings enticing enough to inspire implementing it?

(My guess is if someone wanted to add it though, the interface might be anew option, effectively "assume sorted", which could be applied to theOPTS for the KEYDEF specified for -k/--key ...)



Carl

[Prev in Thread]

Current Thread

[Next in Thread]

Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/11
- Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/11
- Re: how to speed up sort for partially sorted input?, Carl Edquist, 2021/08/11
  - Re: how to speed up sort for partially sorted input?, Peng Yu, 2021/08/11
    - Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/11
    - Re: how to speed up sort for partially sorted input?, Peng Yu, 2021/08/11
    - Re: how to speed up sort for partially sorted input?, Carl Edquist <=
- Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/12
- Re: how to speed up sort for partially sorted input?, Carl Edquist, 2021/08/13

Prev by Date: Re: How to get the past Mon/Tue/.. not after the current date?
Next by Date: Re: how to speed up sort for partially sorted input?
Previous by thread: Re: how to speed up sort for partially sorted input?
Next by thread: Re: how to speed up sort for partially sorted input?
Index(es):
- Date
- Thread