[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: how to speed up sort for partially sorted input?
From: |
Carl Edquist |
Subject: |
Re: how to speed up sort for partially sorted input? |
Date: |
Fri, 13 Aug 2021 05:24:27 -0500 (CDT) |
User-agent: |
Alpine 2.23.2 (DEB 496 2020-07-19) |
On Thu, 12 Aug 2021, Kaz Kylheku (Coreutils) wrote:
The solution doesn't exist today, whereas that Gawk program should run
even in ten year old installations.
For the solution to be useful, it only has to beat the actual sort which
you have available today
Hi Kaz,
The thing is, it seems regular ("unoptimized") sort is still a lot faster
than an "optimized" awk-based solution.
For instance, i'll generate 2M partially-sorted lines:
$ head -c $((6 * 1024**2)) /dev/urandom | xxd -p |
fold -w6 | sed -r 's/(..)(....)/\1 \2/' > unsorted
$ sort -s -k1,1 unsorted > partially-sorted
$ wc -l partially-sorted
2097152 partially-sorted
$ head partially-sorted
00 fcbc
00 52fa
00 8a5a
00 b630
00 a572
00 535d
00 e7d9
00 c987
00 cb53
00 574e
So here we have 2M lines, and 256 distinct values, so an average chunk
size of 8K.
Now time a few different sort methods:
(1) sort col1, col2 (the original specification)
$ time sort -k1,1 -k2,2 partially-sorted | md5sum
cbba6a6e9c80ade9dcbe49d26a6d94ce -
real 0m1.363s
user 0m4.817s
sys 0m0.074s
(2) just sort the whole lines
$ time sort partially-sorted | md5sum
cbba6a6e9c80ade9dcbe49d26a6d94ce -
real 0m0.499s
user 0m1.466s
sys 0m0.071s
(3) your (g)awk program
$ time ./kaz.gawk partially-sorted | md5sum
cbba6a6e9c80ade9dcbe49d26a6d94ce -
real 0m7.583s
user 0m7.597s
sys 0m0.136s
(I include md5sum just to show the outputs are identical.)
(2) will produce identical output to (1) so long as the column separator
characters sort before the non-column chars, which is a clear improvement
as long as this assumption is valid. Sadly the "optimized" awk program
(3) does not help at all. (It's significantly worse.)
...
For comparison, i've also done the same but giving a higher partially-
sorted ratio, to give more advantage to the partially-sorted optimization.
Still 2M lines, but 64k distinct values for col1 (instead of 256), leaving
an average chunk size of 32 lines (instead of 8k).
[In terms of sort algorithm complexity (number of comparisons), for
optimized partially-sorted input this is 2.6x better than the above case
that only had 256 distinct values. (Left as an exercise to the reader is
why it's exactly 2.6x.) But there is other overhead involved also, so
there is no expectation that the overall run time will improve that much.]
$ head -c $((6 * 1024**2)) /dev/urandom | xxd -p |
fold -w6 | sed -r 's/(....)(..)/\1 \2/' > unsorted42
$ sort -s -k1,1 unsorted42 > partially-sorted42
$ head partially-sorted42
0000 fd
0000 3a
0000 81
0000 96
0000 6a
0000 94
0000 88
0000 92
0000 89
0000 e9
And the times:
$ time sort -s -k1,1 -k2,2 partially-sorted42 | md5sum
efb0303789a4e52c3ab27600fad1dac4 -
real 0m0.777s
user 0m2.499s
sys 0m0.071s
$ time sort partially-sorted42 | md5sum
efb0303789a4e52c3ab27600fad1dac4 -
real 0m0.378s
user 0m1.015s
sys 0m0.071s
$ time ./kaz.awk partially-sorted42 | md5sum
71e7c99d2ef6e67e5ad29115e97872b0 -
real 0m4.541s
user 0m4.663s
sys 0m0.081s
Times improve across the board, but with the same pattern as before.
An interesting observation is that performance of the regular sort command
shows improvement with partially sorted input. (That is, without being
told explicitly that it's already partially sorted on col1.)
That makes me even more skeptical about how much room there is left for an
explicit optimization to improve things any further. (Though i'd still be
curious to try it and find out, perhaps at the cost of some sleep.)
Anyway, happy friday!
Carl
PS: for the careful reader who notices the checksum does not match in the
final instance - Kaz it appears there is a bug in your awk script.
(Probably you can reproduce setting up inputs like i have above.)
In particular, comparisons like '$1 != prev_1' will treat things as
numeric where possible, meaning '001' == '1', '00e0' == '0' (scientific
notation), and (surprise!) '00e0' == '00e6' (also scientific notation).
As a result, your awk program treats everything from 00e0 through 00e9 as
the same key, and sorts that entire range according to col2 alone.
You can avoid/fix this by putting '$1 ""' wherever you reference '$1' (to
force interpretation as a string rather than numeric).