bug#40533: GNU sort could be faster on machines with more cores

bug-coreutils
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#40533: GNU sort could be faster on machines with more cores

From:	Ole Tange
Subject:	bug#40533: GNU sort could be faster on machines with more cores
Date:	Fri, 10 Apr 2020 13:11:31 +0200
$ sort --version
sort (GNU coreutils) 8.28

I am the author of GNU Parallel and my fetish is to try to run more
stuff in parallel. I recently sorted a 2.4 GBytes/100 M lines file:

  export LC_ALL=C
  time sort --parallel=48 bigfile >/dev/null

This takes 87 seconds on my 48 core machine. Everything is done in a
RAM disk. By adjusting 48 I find the optimal is --parallel=46: 80
seconds.

But it is possible to do better than that.

A big part of the final part of the sorting is a single tread. It is
likely merging the results of the other threads, but it leaves the
other 47 cores idle.

So I thought: Can we parallelize that more?

The reasoning is: If you have N-1 cores sitting idle, you really do
not care if you waste some CPU cycles, if you can get the result
faster.

So I tested.

Let us start by looking at an easier case: We do not have 1 big file,
but the bigfile is split into 48 files of similar sizes.

In this case we can do:

  sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort
file1) <(sort file2) ) <(sort -m <(sort file3) <(sort file4) ) )
<(sort -m <(sort -m <(sort file5) <(sort file6) ) <(sort -m <(sort
file7) <(sort file8) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file9)
<(sort file10) ) <(sort -m <(sort file11) <(sort file12) ) ) <(sort -m
<(sort -m <(sort file13) <(sort file14) ) <(sort -m <(sort file15)
<(sort file16) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort
file17) <(sort file18) ) <(sort -m <(sort file19) <(sort file20) ) )
<(sort -m <(sort -m <(sort file21) <(sort file22) ) <(sort -m <(sort
file23) <(sort file24) ) ) ) <(sort -m <(sort -m <(sort -m <(sort
file25) <(sort file26) ) <(sort -m <(sort file27) <(sort file28) ) )
<(sort -m <(sort -m <(sort file29) <(sort file30) ) <(sort -m <(sort
file31) <(sort file32) ) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort
-m <(sort -m <(sort file33) <(sort file34) ) <(sort -m <(sort file35)
<(sort file36) ) ) <(sort -m <(sort -m <(sort file37) <(sort file38) )
<(sort -m <(sort file39) <(sort file40) ) ) ) <(sort -m <(sort -m
<(sort -m <(sort file41) <(sort file42) ) <(sort -m <(sort file43)
<(sort file44) ) ) <(sort -m <(sort -m <(sort file45) <(sort file46) )
<(sort -m <(sort file47) <(sort file48) ) ) ) ) )

Or indented:

  sort -m \
     <(sort -m \
            <(sort -m \
                   <(sort -m \
                          <(sort -m \
                                 <(sort -m <(sort file1) <(sort file2) )\
                                 <(sort -m <(sort file3) <(sort file4) ) )\
                          <(sort -m \
                                 <(sort -m <(sort file5) <(sort file6) )\
                                 <(sort -m <(sort file7) <(sort file8) ) ) )\
                   <(sort -m \
                          <(sort -m \
                                 <(sort -m <(sort file9) <(sort file10) )\
                                 <(sort -m <(sort file11) <(sort file12) ) )\
                          <(sort -m \
                                 <(sort -m <(sort file13) <(sort file14) )\
                                 <(sort -m <(sort file15) <(sort
file16) ) ) ) )\
            <(sort -m \
                   <(sort -m \
                          <(sort -m \
                                 <(sort -m <(sort file17) <(sort file18) )\
                                 <(sort -m <(sort file19) <(sort file20) ) )\
                          <(sort -m <(sort -m <(sort file21) <(sort file22) )\
                                 <(sort -m <(sort file23) <(sort file24) ) ) )\
                   <(sort -m \
                          <(sort -m \
                                 <(sort -m <(sort file25) <(sort file26) )\
                                 <(sort -m <(sort file27) <(sort file28) ) )\
                          <(sort -m \
                                 <(sort -m <(sort file29) <(sort file30) )\
                                 <(sort -m <(sort file31) <(sort
file32) ) ) ) ) ) \
     <(sort -m \
            <(sort -m \
                   <(sort -m \
                          <(sort -m \
                                 <(sort -m <(sort file33) <(sort file34) )\
                                 <(sort -m <(sort file35) <(sort file36) ) )\
                          <(sort -m <(sort -m <(sort file37) <(sort file38) )\
                                 <(sort -m <(sort file39) <(sort file40) ) ) )\
                   <(sort -m \
                          <(sort -m \
                                 <(sort -m <(sort file41) <(sort file42) )\
                                 <(sort -m <(sort file43) <(sort file44) ) )\
                          <(sort -m \
                                 <(sort -m <(sort file45) <(sort file46) )\
                                 <(sort -m <(sort file47) <(sort
file48) ) ) ) ) )

So WTF is going on here?

Here we do an initial sort of each of the 48 files. Then we do a merge
sort of a pair of those. Then we do a merge sort of a pair of those.
Then we do a merge sort of a pair of those. Then we do a merge sort of
a pair of those. Then we do a merge sort of a pair of those. And so
on, until we only have a single input. All of this is done in parallel
when possible.

This takes 60 seconds.

But a lot of time is waiting for pipes. This can be lowered by added a
buffer after each merge (mbuffer -m 30M):

  sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort
file1) <(sort file2) | mbuffer -m 30M ) <(sort -m <(sort file3) <(sort
file4) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m
<(sort file5) <(sort file6) | mbuffer -m 30M ) <(sort -m <(sort file7)
<(sort file8) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M )
<(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) | mbuffer
-m 30M ) <(sort -m <(sort file11) <(sort file12) | mbuffer -m 30M ) |
mbuffer -m 30M ) <(sort -m <(sort -m <(sort file13) <(sort file14) |
mbuffer -m 30M ) <(sort -m <(sort file15) <(sort file16) | mbuffer -m
30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort
-m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) |
mbuffer -m 30M ) <(sort -m <(sort file19) <(sort file20) | mbuffer -m
30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file21) <(sort
file22) | mbuffer -m 30M ) <(sort -m <(sort file23) <(sort file24) |
mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m
<(sort -m <(sort -m <(sort file25) <(sort file26) | mbuffer -m 30M )
<(sort -m <(sort file27) <(sort file28) | mbuffer -m 30M ) | mbuffer
-m 30M ) <(sort -m <(sort -m <(sort file29) <(sort file30) | mbuffer
-m 30M ) <(sort -m <(sort file31) <(sort file32) | mbuffer -m 30M ) |
mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m
30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33)
<(sort file34) | mbuffer -m 30M ) <(sort -m <(sort file35) <(sort
file36) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m
<(sort file37) <(sort file38) | mbuffer -m 30M ) <(sort -m <(sort
file39) <(sort file40) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer
-m 30M ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) |
mbuffer -m 30M ) <(sort -m <(sort file43) <(sort file44) | mbuffer -m
30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file45) <(sort
file46) | mbuffer -m 30M ) <(sort -m <(sort file47) <(sort file48) |
mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m
30M ) | mbuffer -m 30M )

This takes 24 seconds.

This is including overhead of starting multiple sorts and mbuffers,
and moving everything through pipes.

If this was in a single process with a single address space, I imagine
this would be even faster: Instead of moving data around, it should be
possible to move only pointers around.

It may also be possible to optimize the merge sort when there are only
two inputs:

  i1 = read(input1)
  i2 = read(input2)
  while(input1 and input2) {
     while(i1 <= i2) {
       print i1
       i1 = read(input1)
     }
     while(i2 <= i1) {
       print i2
       i2 = read(input2)
     }
  }

But I cheated a little: I started with 48 evenly sized files.

I think this could easily be emulated by having a single reader thread
spread blocks of full lines in a round robin style to the 48 threads
that will do the initial sorting. Something like:

  while(read_block()) {
    find_last_newline();
    write_block_to_process(n%48);
    n++;
  }

If we are reading from a disk, then the reading may be slow, so if the
48 initial sorters can start sorting without having the full input,
then we will save some wall clock time.


/Ole
[Prev in Thread]
Current Thread
[Next in Thread]
bug#40533: GNU sort could be faster on machines with more cores, Ole Tange <=
Prev by Date: bug#40530: feature proposal: coreutils -> sort: adding sorting ability for Hebrew numerals
Next by Date: bug#40530: feature proposal: coreutils -> sort: adding sorting ability for Hebrew numerals
Previous by thread: bug#40530: feature proposal: coreutils -> sort: adding sorting ability for Hebrew numerals
Next by thread: bug#40540: Faster sort with locale
Index(es):
- Date
- Thread