[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: how to speed up sort for partially sorted input?
From: |
Carl Edquist |
Subject: |
Re: how to speed up sort for partially sorted input? |
Date: |
Wed, 25 Aug 2021 08:09:26 -0500 (CDT) |
User-agent: |
Alpine 2.23.2 (DEB 496 2020-07-19) |
On Wed, 25 Aug 2021, arthur200126@gmail.com wrote:
The current sort algorithm used by coreutils sort(1) is a mergesort,
which unfortunately does not pick up cues on partially-sorted input. A
relatively well-known algorithm that picks up partial sort order is
timsort [...]
[1]: https://en.wikipedia.org/wiki/Timsort
If i read correctly, timsort takes advantage of input sections that are
already sorted ("runs"), by merge-sorting consecutive runs.
This is a bit different than the partially sorted input originally
described in this thread, where you are sorting on (in this case) column
1, then column 2, but it is known that the input is already sorted
according to column 1.
In the timsort case, if you have two consecutive runs, A and B, then you
have
i < j --> A[i] < A[j]
and
i < j --> B[i] < B[j]
but there is no guarantee that
A[i] < B[j]
On the other hand, the scenario for this thread is the opposite. If you
have two consecutive chuncks, A (column 1 = A), and B (column 1 = B), then
you always have
A[i] < B[j]
but within chunk A the lines are unsorted, as within chunk B.
...
So, while a timsort algorithm might be an interesting alternative, it
doesn't seem it will improve things for this particular case.
Carl