coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to speed up sort for partially sorted input?


From: Kaz Kylheku (Coreutils)
Subject: Re: how to speed up sort for partially sorted input?
Date: Wed, 11 Aug 2021 11:43:40 -0700
User-agent: Roundcube Webmail/0.9.2

On 2021-08-11 05:03, Peng Yu wrote:
On Wed, Aug 11, 2021 at 5:29 AM Carl Edquist <edquist@cs.wisc.edu> wrote:
(With just a bit more work, you can do all your sorting in a single awk
process too (without piping out to sort), but i think you'll still be
disappointed with the performance compared to a single sort command.)

Yes, this involves many calls of the coreuils' sort, which is not

No, not this last remark, which is about "in a single awk process".

efficient. Would it make sense to add an option in sort so that sort
can sort a partially sorted input in one shot.

IF you're willing to use GNU Coreutils instead of Unix, you probably have
GNU Awk also. GNU Awk has a sorting function using which a solution
could be cobbed together. Maybe something like:


function dump_delete_data()
{
   n = asorti(data, idx);
   for (i = 1; i <= n; i++)
     print data[idx[i]];
   delete data
   serial = 0
}

BEGIN           { serial = 0 }
$1 != prev_1    { dump_delete_data() }
NF >= 2         { prev_1 = $1
                  data[$2 "." serial++] = $0
                  next }
1               { dump_delete_data()
                  print }
END             { dump_delete_data() }


The asorti function has some features behind it to sort in various ways;
you have to look into that. It involves manipulating a PROCINFO["sorted_in"]
value.

It's possible to use a custom comparison function.

For more info, see GNU Awk documentation, the Gawk mailing list or
the comp.lang.awk newsgroup.

The purpose of the serial variable in my above code so that we get
two entries in data[] if in a given group, there are identical $2 values.

For instance if $2 is "foo", then the key we use is actually "foo.3" if
the current value of serial is 3. The sorting is then done on these
suffixed keys, which works okay for lexicographic sorting.

It is not a stable sort, though! Because foo.123 will be sorted before
foo.23, even though the 123 serial value comes later. If we padded the
integer with enough leading zeros for the largest possible group, it
would then be stable: foo.00023 would come before foo.00123:

   data[sprintf("%s.%08X", $2, serial++)] = $0

kind of thing.

If you don't care about reproducing duplicates, you can remove this logic
entirely.

How the overall program works is that data[] is an array indexed on the
second column values (plus serial suffixes). The value of each index
value is the entire record, $0.

asorti sorts the $2 indices, throwing away the $0 values, which
is why we direct it into a secondary array called idx, preserving
the data array. The idx array ends up indexed on integer values 1 to N,
where N is the chunk size. If we iterate over these values, idx[i]
gives us the $2 column values (with serial suffix) in sorted order.
We can then use that as the key into data[] to get the corresponding
records in sorted order.

Cheers ...




reply via email to

[Prev in Thread] Current Thread [Next in Thread]