coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?


From: Peng Yu
Subject: Re: how to speed up sort for partially sorted input?
Date: Wed, 11 Aug 2021 13:58:59 -0500

On Wed, Aug 11, 2021 at 1:43 PM Kaz Kylheku (Coreutils)
<962-396-1872@kylheku.com> wrote:
>
> On 2021-08-11 05:03, Peng Yu wrote:
> > On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist <edquist@cs.wisc.edu>
> > wrote:
> >> (With just a bit more work, you can do all your sorting in a single
> >> awk
> >> process too (without piping out to sort), but i think you'll still be
> >> disappointed with the performance compared to a single sort command.)
> >
> > Yes, this involves many calls of the coreuils' sort, which is not
>
> No, not this last remark, which is about "in a single awk process".

I know there is one awk process. I don't understand why you mentioned it.

> > efficient. Would it make sense to add an option in sort so that sort
> > can sort a partially sorted input in one shot.
>
> IF you're willing to use GNU Coreutils instead of Unix, you probably
> have

I don't think using awk is efficient. I am program a number awk
programs for simple transforming the input and tested it, in general,
it is slower than the equivalent python code, let along C code.

You can talk about doing most of the work in awk below. I don't think
that make sense. Having coreutils' sort be able to do a partial sort
is a more reasonable solution.

> GNU Awk also. GNU Awk has a sorting function using which a solution
> could be cobbed together. Maybe something like:
>
>
> function dump_delete_data()
> {
>     n = asorti(data, idx);
>     for (i = 1; i <= n; i++)
>       print data[idx[i]];
>     delete data
>     serial = 0
> }
>
> BEGIN           { serial = 0 }
> $1 != prev_1    { dump_delete_data() }
> NF >= 2         { prev_1 = $1
>                    data[$2 "." serial++] = $0
>                    next }
> 1               { dump_delete_data()
>                    print }
> END             { dump_delete_data() }
>
>
> The asorti function has some features behind it to sort in various ways;
> you have to look into that. It involves manipulating a
> PROCINFO["sorted_in"]
> value.
>
> It's possible to use a custom comparison function.
>
> For more info, see GNU Awk documentation, the Gawk mailing list or
> the comp.lang.awk newsgroup.
>
> The purpose of the serial variable in my above code so that we get
> two entries in data[] if in a given group, there are identical $2
> values.
>
> For instance if $2 is "foo", then the key we use is actually "foo.3" if
> the current value of serial is 3. The sorting is then done on these
> suffixed keys, which works okay for lexicographic sorting.
>
> It is not a stable sort, though! Because foo.123 will be sorted before
> foo.23, even though the 123 serial value comes later. If we padded the
> integer with enough leading zeros for the largest possible group, it
> would then be stable: foo.00023 would come before foo.00123:
>
>     data[sprintf("%s.%08X", $2, serial++)] = $0
>
> kind of thing.
>
> If you don't care about reproducing duplicates, you can remove this
> logic
> entirely.
>
> How the overall program works is that data[] is an array indexed on the
> second column values (plus serial suffixes). The value of each index
> value is the entire record, $0.
>
> asorti sorts the $2 indices, throwing away the $0 values, which
> is why we direct it into a secondary array called idx, preserving
> the data array. The idx array ends up indexed on integer values 1 to N,
> where N is the chunk size. If we iterate over these values, idx[i]
> gives us the $2 column values (with serial suffix) in sorted order.
> We can then use that as the key into data[] to get the corresponding
> records in sorted order.
>
> Cheers ...
>


-- 
Regards,
Peng



reply via email to

[Prev in Thread] Current Thread [Next in Thread]