Re: Shuf reservoir sampling

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Shuf reservoir sampling

From:	Pádraig Brady
Subject:	Re: Shuf reservoir sampling
Date:	Sun, 29 Dec 2013 13:09:58 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 12/28/2013 05:13 PM, Jyothis V wrote:
> Hi, thanks for the reply. I understand why something like reservoir sampling 
> is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir 
> sampling, an array of length head_length is obtained. At this stage, the 
> array is not completely shuffled because the first head_length elements were 
> copied verbatim. 2) This array is then randomly permuted while writing. My 
> question is whether these two steps could be clubbed together, just as shown 
> in the second algorithm given in the wikipedia page you mentioned.

Yes there are more improvements possible in this area.

Also we should avoid the more CPU intensive reservoir sampling
in the common case of reading small inputs from stdin:
http://lists.gnu.org/archive/html/coreutils/2013-03/msg00057.html
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/shuf.c;h=456140#l272

Also reservoir sampling is currently not used with --repeat.
But it does seem possible to use reservoir-sampling with replacement:
https://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf

thanks,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

Shuf reservoir sampling, Jyothis V, 2013/12/28
- Re: Shuf reservoir sampling, Bernhard Voelker, 2013/12/28
  - Re: Shuf reservoir sampling, Jyothis V, 2013/12/28
    - Re: Shuf reservoir sampling, Pádraig Brady <=
  - Message not available
    - Re: Shuf reservoir sampling, Bernhard Voelker, 2013/12/28
    - Re: Shuf reservoir sampling, Jyothis V, 2013/12/28

Prev by Date: Re: Proposed uname -r -v FAQ entry
Next by Date: [PATCH] tests: initial SMACK tests
Previous by thread: Re: Shuf reservoir sampling
Next by thread: Re: Shuf reservoir sampling
Index(es):
- Date
- Thread