bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Davis Houlton
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Tue, 24 May 2005 21:46:13 +0000
User-agent: KMail/1.7.2

On Tuesday 24 May 2005 15:33, Frederik Eaton wrote:
> reason to expand the functionality of 'sort'. But in my opinion a more
> important reason is that the set of commands that one runs on a unix
> system comprise a language, which is a very important language from a
> user's perspective, and if people think that it should make sense
> intuitively to use a certain command to do a certain thing, then that
> command should be extended to do that thing, for the sake of the
> conciseness and clarity and usability of this language. 

I agree that the coreutils represent a toolbox for programing.  Without 
them, /bin/sh would be an empty place indeed!

So, in the same way we have echo / printf, expand / unexpand,  head / tail, I 
like to think that there is also room for both sort and shuffle.   

In my mind, from an functional operator's perspective, I find it very 
intuitive that sort does as it says--sort lines ACB into ABC.  Sorting by 
dates and numbers are a logical extension of the sort metaphor, and again, I 
intuitively look to sort to do these things. But when I went to look for a 
shuffle command--as a user--I did not even think to look at the sort man 
page. I'm not sure if that is typical -- sort is certainly a logical place to 
look--but that was my experience.

A command like shuffle, in my mind, has a different set of functional 
requirements--to take lines ACB and produce one of ABC ACB BAC BCA CAB CBA, 
etc.  As a user, not necessarily aware of technical implementation, sort and 
shuffle make sense as separate commands, in the same way I use different 
methods to sort and shuffle a deck of cards.

Reading these notes has given me pause to challenge my current line of 
thinking regarding large files. Glossing over some gritty implementation 
details, my revised thought process is that for each shuffle execution, there 
are two paths: a computationally fast one involving manipulation of text 
buffers, and one that involves only delimiter locations & line sizes.  
Shuffle would pick the optimum execution path based on input set size; small, 
average cases use text buffers. Larger cases use delimiter locations paged to 
disk.

That is a good point about RAND_MAX. For a case where we have more than 
RAND_MAX lines, we need to make multiple random calls until our net RAND_MAX 
is large enough to encapsulate the entire input set.  For example, suppose we 
used a die to locate a random line. If we had 20 lines, we could use 4d6-3 to 
give us the range we need (1-21).  For 30 lines, 6d6-5 (1-31), etc.  

I currently do not have time (tonight) to think through a scheme that uses 
delimiter locations paged out to disk, but I suspect it would be implemented 
differently from sort?  Not sure yet--or if there would be any benefit.

Some considerations--are file sizes > fseek's fpos_t something that need to be 
considered? 

What is the maximum average footprint considered acceptable for a reduced 
memory environment? 1MB? 4MB? 16k? 2k?  Like Jay pointed out, the general 
algorithm will involve splitting delimiter locations into manageable pages of 
memory, and randomizing each page. Then we select a page at random, and walk 
down that page.  The question is then--what is the maximum page size?

Thanks,
  Davis





reply via email to

[Prev in Thread] Current Thread [Next in Thread]