[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: how to speed up sort for partially sorted input?
From: |
Carl Edquist |
Subject: |
Re: how to speed up sort for partially sorted input? |
Date: |
Wed, 11 Aug 2021 05:29:08 -0500 (CDT) |
User-agent: |
Alpine 2.23.2 (DEB 496 2020-07-19) |
On Tue, 10 Aug 2021, Kaz Kylheku (Coreutils) wrote:
On 2021-08-07 17:46, Peng Yu wrote:
Hi,
Suppose that I want to sort an input by column 1 and column 2 (column
1 is of a higher priority than column 2). The input is already sorted
by column1.
Is there a way to speed up the sort (compared with not knowing column
1 is already sorted)? Thanks.
Since you know that colum 1 is sorted, it means that a sequential scan
of the data will reveal chunks that have the same colum1 value.
You just have to read and separate these chunks, and sort each one
individually by column 2.
Neat observation.
You could do that tersely in awk by piping each chunk to a separate sort
process, like:
awk '
c1 != $1 { close(sort); c1 = $1 }
{ print | sort }
' sort="sort -k2,2" partially-sorted-input.txt
In theory, that would bring the sorting work down from ~ O(n * log(n)) to
~ O(n * log(n/m)) (for a partially-sorted file with n lines and m
column-1 chunks of equal size).
But the overhead of starting a new sort process for each chunk is likely
going to outweigh that advantage. In the end, just sorting the whole file
at once (despite column 1 already being sorted) is still likely to be
faster.
(With just a bit more work, you can do all your sorting in a single awk
process too (without piping out to sort), but i think you'll still be
disappointed with the performance compared to a single sort command.)
Carl