coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] shuf: use reservoir-sampling when possible


From: Assaf Gordon
Subject: Re: [PATCH] shuf: use reservoir-sampling when possible
Date: Mon, 11 Mar 2013 17:10:39 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4

Hello,

Pádraig Brady wrote, On 03/07/2013 06:26 PM:
> On 03/07/2013 07:32 PM, Assaf Gordon wrote:
>> Pádraig Brady wrote, On 03/06/2013 08:24 PM:
>>> On 03/06/2013 11:50 PM, Assaf Gordon wrote:
>>>> Attached is a suggestion to implement reservoir-sampling in shuf:
>>>> When the expected output of lines is known, it will not load the entire 
>>>> file into memory - allowing shuffling very large inputs.
>>>>
>>
>>
>>>> static size_t
>>>> read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, 
>>>> size_t k,
>>>>                                struct randint_source *s)
>> <...>
>>>>   struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init 
>>>> reservoir*/
>>>
>>> Since this change is mainly about efficient mem usage we should probably 
>>> handle
>>> the case where we have small inputs but large k.  This will allocate (and 
>>> zero)
>>> memory up front. The zeroing will defeat any memory overcommit configured 
>>> on the
>>> system, but it's probably better to avoid the large initial commit and 
>>> realloc
>>> as required (not per line, but per 1K lines maybe).
>>>
>>

Attached is an updated version (mostly a re-write of the memory allocation 
part), as per the comment above.
Also includes a very_expensive valgrind test to exercise the new code.
(and the other patch is the uniform-distribution randomness test).

-gordon

Attachment: 0001-shuf-add-expensive-test-for-randomness.patch
Description: Text Data

Attachment: 0002-shuf-use-reservoir-sampling-when-possible.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]