coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?


From: Carl Edquist
Subject: Re: how to speed up sort for partially sorted input?
Date: Wed, 11 Aug 2021 05:29:08 -0500 (CDT)
User-agent: Alpine 2.23.2 (DEB 496 2020-07-19)

On Tue, 10 Aug 2021, Kaz Kylheku (Coreutils) wrote:

On 2021-08-07 17:46, Peng Yu wrote:
 Hi,

 Suppose that I want to sort an input by column 1 and column 2 (column
 1 is of a higher priority than column 2). The input is already sorted
 by column1.

 Is there a way to speed up the sort (compared with not knowing column
 1 is already sorted)? Thanks.

Since you know that colum 1 is sorted, it means that a sequential scan of the data will reveal chunks that have the same colum1 value.

You just have to read and separate these chunks, and sort each one individually by column 2.

Neat observation.

You could do that tersely in awk by piping each chunk to a separate sort process, like:

        awk '
            c1 != $1 { close(sort); c1 = $1 }
            { print | sort }
        ' sort="sort -k2,2" partially-sorted-input.txt

In theory, that would bring the sorting work down from ~ O(n * log(n)) to ~ O(n * log(n/m)) (for a partially-sorted file with n lines and m
column-1 chunks of equal size).

But the overhead of starting a new sort process for each chunk is likely going to outweigh that advantage. In the end, just sorting the whole file at once (despite column 1 already being sorted) is still likely to be faster.

(With just a bit more work, you can do all your sorting in a single awk process too (without piping out to sort), but i think you'll still be disappointed with the performance compared to a single sort command.)


Carl



reply via email to

[Prev in Thread] Current Thread [Next in Thread]