Re: how to speed up sort for partially sorted input?

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?

From:	Carl Edquist
Subject:	Re: how to speed up sort for partially sorted input?
Date:	Fri, 13 Aug 2021 05:24:27 -0500 (CDT)
User-agent:	Alpine 2.23.2 (DEB 496 2020-07-19)

On Thu, 12 Aug 2021, Kaz Kylheku (Coreutils) wrote:

The solution doesn't exist today, whereas that Gawk program should runeven in ten year old installations.
For the solution to be useful, it only has to beat the actual sort whichyou have available today


Hi Kaz,

The thing is, it seems regular ("unoptimized") sort is still a lot fasterthan an "optimized" awk-based solution.


For instance, i'll generate 2M partially-sorted lines:

        $ head -c $((6 * 1024**2)) /dev/urandom | xxd -p |
          fold -w6 | sed -r 's/(..)(....)/\1 \2/' > unsorted

        $ sort -s -k1,1 unsorted > partially-sorted

        $ wc -l partially-sorted
        2097152 partially-sorted

        $ head partially-sorted
        00 fcbc
        00 52fa
        00 8a5a
        00 b630
        00 a572
        00 535d
        00 e7d9
        00 c987
        00 cb53
        00 574e

So here we have 2M lines, and 256 distinct values, so an average chunksize of 8K.


Now time a few different sort methods:

(1) sort col1, col2 (the original specification)

        $ time sort -k1,1 -k2,2 partially-sorted | md5sum
        cbba6a6e9c80ade9dcbe49d26a6d94ce  -

        real    0m1.363s
        user    0m4.817s
        sys     0m0.074s

(2) just sort the whole lines

        $ time sort partially-sorted | md5sum
        cbba6a6e9c80ade9dcbe49d26a6d94ce  -

        real    0m0.499s
        user    0m1.466s
        sys     0m0.071s

(3) your (g)awk program

        $ time ./kaz.gawk partially-sorted | md5sum
        cbba6a6e9c80ade9dcbe49d26a6d94ce  -

        real    0m7.583s
        user    0m7.597s
        sys     0m0.136s

(I include md5sum just to show the outputs are identical.)

(2) will produce identical output to (1) so long as the column separatorcharacters sort before the non-column chars, which is a clear improvementas long as this assumption is valid. Sadly the "optimized" awk program(3) does not help at all. (It's significantly worse.)

...

For comparison, i've also done the same but giving a higher partially-sorted ratio, to give more advantage to the partially-sorted optimization.Still 2M lines, but 64k distinct values for col1 (instead of 256), leavingan average chunk size of 32 lines (instead of 8k).

[In terms of sort algorithm complexity (number of comparisons), foroptimized partially-sorted input this is 2.6x better than the above casethat only had 256 distinct values. (Left as an exercise to the reader iswhy it's exactly 2.6x.) But there is other overhead involved also, sothere is no expectation that the overall run time will improve that much.]


        $ head -c $((6 * 1024**2)) /dev/urandom | xxd -p |
          fold -w6 | sed -r 's/(....)(..)/\1 \2/' > unsorted42

        $ sort -s -k1,1 unsorted42 > partially-sorted42

        $ head partially-sorted42
        0000 fd
        0000 3a
        0000 81
        0000 96
        0000 6a
        0000 94
        0000 88
        0000 92
        0000 89
        0000 e9

And the times:

        $ time sort -s -k1,1 -k2,2 partially-sorted42 | md5sum
        efb0303789a4e52c3ab27600fad1dac4  -

        real    0m0.777s
        user    0m2.499s
        sys     0m0.071s

        $ time sort partially-sorted42 | md5sum
        efb0303789a4e52c3ab27600fad1dac4  -

        real    0m0.378s
        user    0m1.015s
        sys     0m0.071s

        $ time ./kaz.awk partially-sorted42 | md5sum
        71e7c99d2ef6e67e5ad29115e97872b0  -

        real    0m4.541s
        user    0m4.663s
        sys     0m0.081s


Times improve across the board, but with the same pattern as before.

An interesting observation is that performance of the regular sort commandshows improvement with partially sorted input. (That is, without beingtold explicitly that it's already partially sorted on col1.)

That makes me even more skeptical about how much room there is left for anexplicit optimization to improve things any further. (Though i'd still becurious to try it and find out, perhaps at the cost of some sleep.)



Anyway, happy friday!

Carl

PS: for the careful reader who notices the checksum does not match in thefinal instance - Kaz it appears there is a bug in your awk script.(Probably you can reproduce setting up inputs like i have above.)

In particular, comparisons like '$1 != prev_1' will treat things asnumeric where possible, meaning '001' == '1', '00e0' == '0' (scientificnotation), and (surprise!) '00e0' == '00e6' (also scientific notation).

As a result, your awk program treats everything from 00e0 through 00e9 asthe same key, and sorts that entire range according to col2 alone.

You can avoid/fix this by putting '$1 ""' wherever you reference '$1' (toforce interpretation as a string rather than numeric).

[Prev in Thread]

Current Thread

[Next in Thread]

Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/11
- Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/11
- Re: how to speed up sort for partially sorted input?, Carl Edquist, 2021/08/11
  - Re: how to speed up sort for partially sorted input?, Peng Yu, 2021/08/11
    - Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/11
    - Re: how to speed up sort for partially sorted input?, Peng Yu, 2021/08/11
    - Re: how to speed up sort for partially sorted input?, Carl Edquist, 2021/08/12
- Re: how to speed up sort for partially sorted input?, Kaz Kylheku (Coreutils), 2021/08/12
- Re: how to speed up sort for partially sorted input?, Carl Edquist <=

Prev by Date: Re: how to speed up sort for partially sorted input?
Next by Date: Re: version-sort ugliness or bugs
Previous by thread: Re: how to speed up sort for partially sorted input?
Next by thread: Re: How to get the past Mon/Tue/.. not after the current date?
Index(es):
- Date
- Thread