coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?


From: Carl Edquist
Subject: Re: how to speed up sort for partially sorted input?
Date: Fri, 13 Aug 2021 05:24:27 -0500 (CDT)
User-agent: Alpine 2.23.2 (DEB 496 2020-07-19)

On Thu, 12 Aug 2021, Kaz Kylheku (Coreutils) wrote:

The solution doesn't exist today, whereas that Gawk program should run even in ten year old installations.

For the solution to be useful, it only has to beat the actual sort which you have available today

Hi Kaz,

The thing is, it seems regular ("unoptimized") sort is still a lot faster than an "optimized" awk-based solution.

For instance, i'll generate 2M partially-sorted lines:

        $ head -c $((6 * 1024**2)) /dev/urandom | xxd -p |
          fold -w6 | sed -r 's/(..)(....)/\1 \2/' > unsorted

        $ sort -s -k1,1 unsorted > partially-sorted

        $ wc -l partially-sorted
        2097152 partially-sorted

        $ head partially-sorted
        00 fcbc
        00 52fa
        00 8a5a
        00 b630
        00 a572
        00 535d
        00 e7d9
        00 c987
        00 cb53
        00 574e

So here we have 2M lines, and 256 distinct values, so an average chunk size of 8K.

Now time a few different sort methods:

(1) sort col1, col2 (the original specification)

        $ time sort -k1,1 -k2,2 partially-sorted | md5sum
        cbba6a6e9c80ade9dcbe49d26a6d94ce  -

        real    0m1.363s
        user    0m4.817s
        sys     0m0.074s

(2) just sort the whole lines

        $ time sort partially-sorted | md5sum
        cbba6a6e9c80ade9dcbe49d26a6d94ce  -

        real    0m0.499s
        user    0m1.466s
        sys     0m0.071s

(3) your (g)awk program

        $ time ./kaz.gawk partially-sorted | md5sum
        cbba6a6e9c80ade9dcbe49d26a6d94ce  -

        real    0m7.583s
        user    0m7.597s
        sys     0m0.136s

(I include md5sum just to show the outputs are identical.)


(2) will produce identical output to (1) so long as the column separator characters sort before the non-column chars, which is a clear improvement as long as this assumption is valid. Sadly the "optimized" awk program (3) does not help at all. (It's significantly worse.)

...

For comparison, i've also done the same but giving a higher partially- sorted ratio, to give more advantage to the partially-sorted optimization. Still 2M lines, but 64k distinct values for col1 (instead of 256), leaving an average chunk size of 32 lines (instead of 8k).

[In terms of sort algorithm complexity (number of comparisons), for optimized partially-sorted input this is 2.6x better than the above case that only had 256 distinct values. (Left as an exercise to the reader is why it's exactly 2.6x.) But there is other overhead involved also, so there is no expectation that the overall run time will improve that much.]

        $ head -c $((6 * 1024**2)) /dev/urandom | xxd -p |
          fold -w6 | sed -r 's/(....)(..)/\1 \2/' > unsorted42

        $ sort -s -k1,1 unsorted42 > partially-sorted42

        $ head partially-sorted42
        0000 fd
        0000 3a
        0000 81
        0000 96
        0000 6a
        0000 94
        0000 88
        0000 92
        0000 89
        0000 e9

And the times:

        $ time sort -s -k1,1 -k2,2 partially-sorted42 | md5sum
        efb0303789a4e52c3ab27600fad1dac4  -

        real    0m0.777s
        user    0m2.499s
        sys     0m0.071s

        $ time sort partially-sorted42 | md5sum
        efb0303789a4e52c3ab27600fad1dac4  -

        real    0m0.378s
        user    0m1.015s
        sys     0m0.071s

        $ time ./kaz.awk partially-sorted42 | md5sum
        71e7c99d2ef6e67e5ad29115e97872b0  -

        real    0m4.541s
        user    0m4.663s
        sys     0m0.081s


Times improve across the board, but with the same pattern as before.

An interesting observation is that performance of the regular sort command shows improvement with partially sorted input. (That is, without being told explicitly that it's already partially sorted on col1.)

That makes me even more skeptical about how much room there is left for an explicit optimization to improve things any further. (Though i'd still be curious to try it and find out, perhaps at the cost of some sleep.)


Anyway, happy friday!

Carl



PS: for the careful reader who notices the checksum does not match in the final instance - Kaz it appears there is a bug in your awk script. (Probably you can reproduce setting up inputs like i have above.)

In particular, comparisons like '$1 != prev_1' will treat things as numeric where possible, meaning '001' == '1', '00e0' == '0' (scientific notation), and (surprise!) '00e0' == '00e6' (also scientific notation).

As a result, your awk program treats everything from 00e0 through 00e9 as the same key, and sorts that entire range according to col2 alone.

You can avoid/fix this by putting '$1 ""' wherever you reference '$1' (to force interpretation as a string rather than numeric).



reply via email to

[Prev in Thread] Current Thread [Next in Thread]