coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?


From: Peng Yu
Subject: Re: how to speed up sort for partially sorted input?
Date: Wed, 11 Aug 2021 07:03:29 -0500

On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist <edquist@cs.wisc.edu> wrote:
>
> On Tue, 10 Aug 2021, Kaz Kylheku (Coreutils) wrote:
>
> > On 2021-08-07 17:46, Peng Yu wrote:
> >>  Hi,
> >>
> >>  Suppose that I want to sort an input by column 1 and column 2 (column
> >>  1 is of a higher priority than column 2). The input is already sorted
> >>  by column1.
> >>
> >>  Is there a way to speed up the sort (compared with not knowing column
> >>  1 is already sorted)? Thanks.
> >
> > Since you know that colum 1 is sorted, it means that a sequential scan
> > of the data will reveal chunks that have the same colum1 value.
> >
> > You just have to read and separate these chunks, and sort each one
> > individually by column 2.
>
> Neat observation.
>
> You could do that tersely in awk by piping each chunk to a separate sort
> process, like:
>
>         awk '
>             c1 != $1 { close(sort); c1 = $1 }
>             { print | sort }
>         ' sort="sort -k2,2" partially-sorted-input.txt
>
> In theory, that would bring the sorting work down from ~ O(n * log(n)) to
> ~ O(n * log(n/m)) (for a partially-sorted file with n lines and m
> column-1 chunks of equal size).
>
> But the overhead of starting a new sort process for each chunk is likely
> going to outweigh that advantage.  In the end, just sorting the whole file
> at once (despite column 1 already being sorted) is still likely to be
> faster.
>
> (With just a bit more work, you can do all your sorting in a single awk
> process too (without piping out to sort), but i think you'll still be
> disappointed with the performance compared to a single sort command.)

Yes, this involves many calls of the coreuils' sort, which is not
efficient. Would it make sense to add an option in sort so that sort
can sort a partially sorted input in one shot.

-- 
Regards,
Peng



reply via email to

[Prev in Thread] Current Thread [Next in Thread]