bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: James Youngman
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Tue, 24 May 2005 09:38:31 +0100
User-agent: Mutt/1.3.28i

On Mon, May 23, 2005 at 08:02:19PM +0000, Davis Houlton wrote:
> On Monday 23 May 2005 16:35, you wrote:
> > So, I think that "shuffle" is a good idea.
> Great!  As I wasn't sure if this was a good idea or not, right now the 
> functionality is quite minimal.  I agree that it needs to be exapnded, and 
> will do what I can to do so.

Of course, I'm not one of the coreutils maintainers, so my opinion has
no weight.

> > Does it produce all possible outputs with equal probability?

> I would say so. The randomizing function is that old standby
>  return 1+(rand()%line_count);

You have a problem where a file has more than RAND_MAX lines.  See
below.

[...]
> redudancy--I will make those (right now I'm using a linked list to
> store the file in memory. This really should be a pointer of
> strings, of course).  It's fast enough for me on 2ghz hardware, but
> might as well go the extra mile.
[...]
> I also realized I need to add a caveat section--this utility is
> memory bound (currently). If you have 4GB of file to randomize and
> only 1GB of memory, well, thems the bricks. The obvious way around
> this is to first scan and create an index of delimiter locations,
> then fseek through a random walk of said delimiters--but I wasn't
> sure if it was really neccessary at this point--can't imagine it
> would be very fast, and then there's the issue of stdin. Probably
> need to add this as a --use-index option.

The GNU project likes to avoid arbitrary limits.  Take a look at the
way that 'sort' handles the same situation.

Rather than doing that many random reads, it is probably more
efficient to do something like this:

  1.    Read N lines from the input file(s), where N is chosen
        such that the data fits into memory.

  2.    Shuffle each chunk of N lines, and write the shuffled data to 
        a temorary file.   Continue until all the input has been
        read and sent to a temporary file (if the input all fits
        into memory, skip to step 4).

  3.    Read the temporary files back, selecting the next 
        line from a randomly selected temporary file until 
        all the temporary files have been completely read

  4.    On the final merge pass (if any) send the data to the 
        output file instead of a temporary file.  Delete all the 
        temporary files (you need to do this manually, since the
        resource limit on the number of temporary files prevents 
        you using tmpfile()).

This is quite complex to implement since you can only merge M
temporary files in the merge pass, where (M < RAND_MAX) and (M < the
RLIMIT_NOFILE resource limit).  Because of this complexity, you might
want to get some feedback from the coreutils maintainers before you
consider making that change because it's quite significant.

Regards,
James Youngman.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]