[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Shuf reservoir sampling
From: |
Pádraig Brady |
Subject: |
Re: Shuf reservoir sampling |
Date: |
Sun, 29 Dec 2013 13:09:58 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 12/28/2013 05:13 PM, Jyothis V wrote:
> Hi, thanks for the reply. I understand why something like reservoir sampling
> is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir
> sampling, an array of length head_length is obtained. At this stage, the
> array is not completely shuffled because the first head_length elements were
> copied verbatim. 2) This array is then randomly permuted while writing. My
> question is whether these two steps could be clubbed together, just as shown
> in the second algorithm given in the wikipedia page you mentioned.
Yes there are more improvements possible in this area.
Also we should avoid the more CPU intensive reservoir sampling
in the common case of reading small inputs from stdin:
http://lists.gnu.org/archive/html/coreutils/2013-03/msg00057.html
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/shuf.c;h=456140#l272
Also reservoir sampling is currently not used with --repeat.
But it does seem possible to use reservoir-sampling with replacement:
https://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf
thanks,
Pádraig.