Re: [PATCH] shuf: use reservoir-sampling when possible

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] shuf: use reservoir-sampling when possible

From:	Assaf Gordon
Subject:	Re: [PATCH] shuf: use reservoir-sampling when possible
Date:	Thu, 07 Mar 2013 14:32:25 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4

Hello,

Attached is an updated version.

Pádraig Brady wrote, On 03/06/2013 08:24 PM:
> On 03/06/2013 11:50 PM, Assaf Gordon wrote:
>> Attached is a suggestion to implement reservoir-sampling in shuf:
>> When the expected output of lines is known, it will not load the entire file 
>> into memory - allowing shuffling very large inputs.
>>

Regarding comments:

>>  {"-debug", no_argument, NULL, DEV_DEBUG_OPTION},
> no need to keep this, for final commit.

Yes, I'll remove this once the code is acceptable.

>> prepare_shuf_lines (struct linebuffer *in_lines, size_t n, char ***out_lines,
> 
> I've not looked into the details, but it would
> be nice to avoid the memcpy/conversion here
> 

I've removed the conversion function, and instead added a new function to 
output the lines directly.

>> static size_t
>> read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, size_t 
>> k,
>>                                struct randint_source *s)
<...>
>>   struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init 
>> reservoir*/
> 
> Since this change is mainly about efficient mem usage we should probably 
> handle
> the case where we have small inputs but large k.  This will allocate (and 
> zero)
> memory up front. The zeroing will defeat any memory overcommit configured on 
> the
> system, but it's probably better to avoid the large initial commit and realloc
> as required (not per line, but per 1K lines maybe).
> 

I'm not quite sure about this:
The "reservoir-sampling" path can only be used when the user explicitly ask to 
limit output lines.
I would naively assume that if a user explicitly asked to limit the output to 
1,000,000 lines, he/she expects large input as well.
And so the (edge?) case of asking for a large number of output lines, but 
supplying very small number of input lines is rare.
Wouldn't you agree? or is there a different typical usage case?

Also, the allocation only allocates an array of "struct linebuffer" (on 64bit 
systems, 24 bytes).
So even asking for 1M lines will allocate 24MB of RAM - not "too much" on 
modern machines.

----

The second attached patch is experimental - it tries to assess the randomness 
of 'shuf' output by running it 1,000 times and checking if the output is (very 
roughly) uniformly distributed.
I don't know if there were attempts in the past to unit-test randomness (and 
then decided not to do it) - or if this was just never considered worth-while 
(or too error prone).

Comments are welcomed,
 -gordon

0001-shuf-add-expensive-test-for-randomness.patch
Description: Text Data

0002-shuf-use-reservoir-sampling-when-possible.patch
Description: Text Data

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/06
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/06
  - Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon <=
    - Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/07
    - Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/11
    - Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/25
    - Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/25
    - Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/25

Prev by Date: date opposite case flag
Next by Date: [PATCH] csplit: new option --suppress-matched
Previous by thread: Re: [PATCH] shuf: use reservoir-sampling when possible
Next by thread: Re: [PATCH] shuf: use reservoir-sampling when possible
Index(es):
- Date
- Thread