Re: Sort Files Merge

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sort Files Merge

From:	Leslie S Satenstein
Subject:	Re: Sort Files Merge
Date:	Fri, 26 Feb 2016 01:41:35 +0000 (UTC)

 This is how I understand the logic.
In my mind, for large files, there is a requirement for a reserved area in ram, 
a sort area, so that a block of key data (chunk) can be sorted in ram and then 
the key information written to a file at some offset.  What goes into the chunk 
is an array of sorted keys along with some "input record offset" information 
and perhaps "record length" information along with the sorted key.  

After creating the chunks.  suppose there is ram to support a "n way" merge. (n 
= 2,4,8,16...)  If the total number of chunks is less than n, all that is 
necessary is to merge the n chunks to the a routine that reads the original 
input file records and writes each to the output file. after which the sort is 
complete.   

The ram sort routine may look for long chains of insequence records. therefore, 
chunks do not all have to be the same size.

After building up the collection of chunks, the second merge step occurs.
When there are more than n chunks, the idea would be to try and do a up-to "n 
way" merge of the smallest chunks to the next level.  A consideration is that 
the repeated n way merge steps  may, on the final merge operation, just merge 
enough chunks so that the final run is a n way merge.
Eventually there will be n or fewer large chunks, that are merged to an a 
"transfer to output" function. The TtoO function reads the winning sortkey 
record, and copies the input record to the output file.
Here is an example I contrived. A large file is read, sort keys are created and 
in the process 31 chunks are written to a work file.
a 8 way merge is possible,  After ram sorting each chunk there are 31 chunks of 
equal size in a file (I don't think sort uses multiple files)..
step 1   merge 8 chunks yielding 23 remaining + 1 new large chunk  (23 small 
and 1 which is 8x larger than the small chunk)
step 2  merge 8 of the smallest chunks yielding 15 remaining + 2 new large 
chunks  (one additional large chunk)
step 3 merge 8 of the smallest chunks yielding   7 remaining   +3 large 
chunksstep 4 merge 3 of the small chunks       yielding  4  remaining +4 large 
chunksOutput merge 4 chunks and 4 large chunks to the output file.

The records in the input file are never moved to chunks, but only the key 
values and the record location of the corresponding key. 

For records with duplicate keys, the record offset within the input file 
determines which of the duplicates is first and which is second.

Regards 
 Leslie
 Mr. Leslie Satenstein
Montréal Québec, Canada



 
      From: Gil Shomron <address@hidden>
 To: address@hidden 
 Sent: Thursday, February 25, 2016 5:29 PM
 Subject: Sort Files Merge
   
Hi all,

I've been investigating sort functionality when sorting big files 
(larger than RAM size). I'm looking at the system calls and the actual 
read and write related instructions the application uses.
For sort to manage large files, my guess is, it implements some kind of 
external merge sort algorithm. And indeed, I can see sort load chunks 
from the big file, sorts them, and saves them back to disk when sorted.

But looking a bit further there are two things I do not understand:
1) There are plenty of reads to different buffers of different sizes 
(49152, 36864, 28672 for example) - why? I mean, why it doesn't just 
read the whole chunk once and then merge sort it?
2) After splitting and sorting each chunk, sort needs to merge them to 
one sorted list. When I examine the instructions it seems like sort does 
A LOT of accesses to those buffers in order to sort them, almost at the 
same intensity of merging each chunk - why? doesn't it just need to look 
at the beginning of each file, check where is the smallest value, put it 
in an output buffer and then write it to the final output file?

I've been digging in the code, but unfortunately didn't find my answers 
there yet.

I will really appreciate your help with helping me understand how sort 
is really implemented.

Thanks!
    Gil Shomron.

[Prev in Thread]

Current Thread

[Next in Thread]

Sort Files Merge, Gil Shomron, 2016/02/25
- Re: Sort Files Merge, Leslie S Satenstein <=

Prev by Date: Sort Files Merge
Next by Date: Possible enhancement: file count
Previous by thread: Sort Files Merge
Next by thread: Possible enhancement: file count
Index(es):
- Date
- Thread