[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Sort Files Merge
From: |
Leslie S Satenstein |
Subject: |
Re: Sort Files Merge |
Date: |
Fri, 26 Feb 2016 01:41:35 +0000 (UTC) |
This is how I understand the logic.
In my mind, for large files, there is a requirement for a reserved area in ram,
a sort area, so that a block of key data (chunk) can be sorted in ram and then
the key information written to a file at some offset. What goes into the chunk
is an array of sorted keys along with some "input record offset" information
and perhaps "record length" information along with the sorted key.
After creating the chunks. suppose there is ram to support a "n way" merge. (n
= 2,4,8,16...) If the total number of chunks is less than n, all that is
necessary is to merge the n chunks to the a routine that reads the original
input file records and writes each to the output file. after which the sort is
complete.
The ram sort routine may look for long chains of insequence records. therefore,
chunks do not all have to be the same size.
After building up the collection of chunks, the second merge step occurs.
When there are more than n chunks, the idea would be to try and do a up-to "n
way" merge of the smallest chunks to the next level. A consideration is that
the repeated n way merge steps may, on the final merge operation, just merge
enough chunks so that the final run is a n way merge.
Eventually there will be n or fewer large chunks, that are merged to an a
"transfer to output" function. The TtoO function reads the winning sortkey
record, and copies the input record to the output file.
Here is an example I contrived. A large file is read, sort keys are created and
in the process 31 chunks are written to a work file.
a 8 way merge is possible, After ram sorting each chunk there are 31 chunks of
equal size in a file (I don't think sort uses multiple files)..
step 1 merge 8 chunks yielding 23 remaining + 1 new large chunk (23 small
and 1 which is 8x larger than the small chunk)
step 2 merge 8 of the smallest chunks yielding 15 remaining + 2 new large
chunks (one additional large chunk)
step 3 merge 8 of the smallest chunks yielding 7 remaining +3 large
chunksstep 4 merge 3 of the small chunks yielding 4 remaining +4 large
chunksOutput merge 4 chunks and 4 large chunks to the output file.
The records in the input file are never moved to chunks, but only the key
values and the record location of the corresponding key.
For records with duplicate keys, the record offset within the input file
determines which of the duplicates is first and which is second.
Regards
Leslie
Mr. Leslie Satenstein
Montréal Québec, Canada
From: Gil Shomron <address@hidden>
To: address@hidden
Sent: Thursday, February 25, 2016 5:29 PM
Subject: Sort Files Merge
Hi all,
I've been investigating sort functionality when sorting big files
(larger than RAM size). I'm looking at the system calls and the actual
read and write related instructions the application uses.
For sort to manage large files, my guess is, it implements some kind of
external merge sort algorithm. And indeed, I can see sort load chunks
from the big file, sorts them, and saves them back to disk when sorted.
But looking a bit further there are two things I do not understand:
1) There are plenty of reads to different buffers of different sizes
(49152, 36864, 28672 for example) - why? I mean, why it doesn't just
read the whole chunk once and then merge sort it?
2) After splitting and sorting each chunk, sort needs to merge them to
one sorted list. When I examine the instructions it seems like sort does
A LOT of accesses to those buffers in order to sort them, almost at the
same intensity of merging each chunk - why? doesn't it just need to look
at the beginning of each file, check where is the smallest value, put it
in an output buffer and then write it to the final output file?
I've been digging in the code, but unfortunately didn't find my answers
there yet.
I will really appreciate your help with helping me understand how sort
is really implemented.
Thanks!
Gil Shomron.
- Sort Files Merge, Gil Shomron, 2016/02/25
- Re: Sort Files Merge,
Leslie S Satenstein <=