bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Frederik Eaton
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Sun, 29 May 2005 22:27:49 -0700
User-agent: Mutt/1.5.9i

I'm not following exactly - in part I think it is premature to discuss
implementation details now. And as for the idea to put "shuffle"
functionality in a separate command, this and other issues were
discussed at length in the previous thread which starts here:

http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html

Basically, sometimes you want to be able to sort on one key and
shuffle on another, which means that your hypothetical 'shuffle' would
have to have a superset of the 'sort' functionality. Not an ideal
situation. Not only would the implementations be very similar, but,
more importantly, the APIs would have a lot of overlap as well.

Also, just because some users might look for "shuffle" functionality
first in a "shuffle" command doesn't mean that we should put it there.
You only have to learn that the functionality is provided by "sort"
once, and it doesn't make sense to sacrifice too much usability or
maintainability to try to captivate the small minority of users who
are first-time users (as commercial software vendors tend to do).
(Now, it also doesn't make sense to have to work out a line-long awk
script every time you need to shuffle something)

Frederik

On Tue, May 24, 2005 at 09:46:13PM +0000, Davis Houlton wrote:
> On Tuesday 24 May 2005 15:33, Frederik Eaton wrote:
> > reason to expand the functionality of 'sort'. But in my opinion a more
> > important reason is that the set of commands that one runs on a unix
> > system comprise a language, which is a very important language from a
> > user's perspective, and if people think that it should make sense
> > intuitively to use a certain command to do a certain thing, then that
> > command should be extended to do that thing, for the sake of the
> > conciseness and clarity and usability of this language. 
> 
> I agree that the coreutils represent a toolbox for programing.  Without 
> them, /bin/sh would be an empty place indeed!
> 
> So, in the same way we have echo / printf, expand / unexpand,  head / tail, I 
> like to think that there is also room for both sort and shuffle.   
> 
> In my mind, from an functional operator's perspective, I find it very 
> intuitive that sort does as it says--sort lines ACB into ABC.  Sorting by 
> dates and numbers are a logical extension of the sort metaphor, and again, I 
> intuitively look to sort to do these things. But when I went to look for a 
> shuffle command--as a user--I did not even think to look at the sort man 
> page. I'm not sure if that is typical -- sort is certainly a logical place to 
> look--but that was my experience.
> 
> A command like shuffle, in my mind, has a different set of functional 
> requirements--to take lines ACB and produce one of ABC ACB BAC BCA CAB CBA, 
> etc.  As a user, not necessarily aware of technical implementation, sort and 
> shuffle make sense as separate commands, in the same way I use different 
> methods to sort and shuffle a deck of cards.
> 
> Reading these notes has given me pause to challenge my current line of 
> thinking regarding large files. Glossing over some gritty implementation 
> details, my revised thought process is that for each shuffle execution, there 
> are two paths: a computationally fast one involving manipulation of text 
> buffers, and one that involves only delimiter locations & line sizes.  
> Shuffle would pick the optimum execution path based on input set size; small, 
> average cases use text buffers. Larger cases use delimiter locations paged to 
> disk.
> 
> That is a good point about RAND_MAX. For a case where we have more than 
> RAND_MAX lines, we need to make multiple random calls until our net RAND_MAX 
> is large enough to encapsulate the entire input set.  For example, suppose we 
> used a die to locate a random line. If we had 20 lines, we could use 4d6-3 to 
> give us the range we need (1-21).  For 30 lines, 6d6-5 (1-31), etc.  
> 
> I currently do not have time (tonight) to think through a scheme that uses 
> delimiter locations paged out to disk, but I suspect it would be implemented 
> differently from sort?  Not sure yet--or if there would be any benefit.
> 
> Some considerations--are file sizes > fseek's fpos_t something that need to 
> be 
> considered? 
> 
> What is the maximum average footprint considered acceptable for a reduced 
> memory environment? 1MB? 4MB? 16k? 2k?  Like Jay pointed out, the general 
> algorithm will involve splitting delimiter locations into manageable pages of 
> memory, and randomizing each page. Then we select a page at random, and walk 
> down that page.  The question is then--what is the maximum page size?
> 
> Thanks,
>   Davis
> 
> 
> 
> _______________________________________________
> Bug-coreutils mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/bug-coreutils
> 

-- 
http://ofb.net/~frederik/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]