bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

fix for inefficient merges in sort.c


From: Lemley James - jlemle
Subject: fix for inefficient merges in sort.c
Date: Thu, 10 Feb 2005 10:57:45 -0600

Hi coreutils hackers, 

 

I see this topic has come up on the list before (searching for NMERGE in the
archives), but in coreutils 5.3.0 sort.c, temporary files are still merged
in a linear-search fashion.  Admittedly I am new to this list and haven't
read every note, but I didn't find a reason for this...  If a patch like
this has been rejected in the past, I am interested to know the reason.

 

I have done quite a lot of testing with sort.c and large files, and the
linear-search merge code can become a cpu bottleneck if NMERGE is turned up
to avoid extra merge passes.  However, for the case where there are a lot of
duplicate keys or the file is already sorted, the next key is very often
found in the same temp file, so a linear search is fastest, and implementing
a binary search for every case would slow down sort on mostly-already-sorted
input (it happens all the time).

 

I've coded a patch to sort.c from coreutils 5.3.0 for your careful
consideration (and my use ;) to perform a binary search in the merge pass,
only if the key was not found in the first merge temp file.  This removes
the CPU bottleneck for large NMERGE values (even 1024 is OK!) and doesn't
have much of a performance impact on NMERGE=16, which is the default... but
even at 16, it's quite better at merging temp-files of random keys and not
more than a percent worse on already-sorted input files, due to the extra
check to see if this is the first file or not.  The case where the input is
already sorted, or there are lots of dupe keys, performs about the same
because the binary search is not performed if the key is found in the first
merge file.  

 

The upshot is you can now sort that 100GB file quite a lot faster with only
one merge pass, if you recompile with large enough NMERGE.  Rah! 

 

Attached is my patch, which I think you will find performs quite well on
random keys and about the same on an already sorted files, and doesn't
affect in-memory sorts at all.  If this is of interest, FSF may have
copyright to this patch in return for a much-coveted line in the THANKS
file. 

 

Thanks very much, 

James Lemley

 

 

 

 

 



**********************************************************************
The information contained in this communication is
confidential, is intended only for the use of the recipient
named above, and may be legally privileged.
If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication is strictly
prohibited.
If you have received this communication in error,
please re-send this communication to the sender and
delete the original message or any copy of it from your
computer system. Thank You.

Attachment: sort-5.3.0_merge.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]