coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Shuf reservoir sampling


From: Assaf Gordon
Subject: Re: Shuf reservoir sampling
Date: Thu, 23 Jan 2014 12:38:43 -0500
User-agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hello,

(reviving an old thread, sorry for the delayed response).

>> On 12/28/2013 03:36 PM, Jyothis V wrote:
<...>
Hi, thanks for the reply. I understand why something like reservoir
sampling is needed. But in shuf.c, shuffling is done in two steps:
1) using reservoir sampling, an array of length head_length is obtained.
At this stage, the array is not completely shuffled because the
first head_length elements were copied verbatim. 2) This array is
then randomly permuted while writing. My question is whether these
two steps could be clubbed together, just as shown in the second
algorithm given in the wikipedia page you mentioned.

I didn't have a look into the Maths behind yet, nor was I involved
during that last improvement. Further improvement is maybe possible,
and the best way to push this is providing code. Are you willing
to propose such a patch?


Regarding the shuffle correctness:
Yes, the data is first read into the array, and only later permuted.
I believe the implementation is correct (ie it does randomly shuffles the input), and if this is not the case, it's a bug and should be fixed.

Regarding the implementation:
In "shuf.c" there's an intricate interplay between reading the input and writing the output - notice that the input is closed explicitly half-way through "main()", before any output is ever written.

The initial patch was written to maintain this behavior, and minimally disrupt the existing code flow:
http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0f7e57d9a98f0ee811e31c757e9fedfff

This is not to say a better implementation is not possible, just that there are few technical details to note before changing 'shuf'.

There are certainly ways to improve the code.

HTH,
 -Gordon




reply via email to

[Prev in Thread] Current Thread [Next in Thread]