[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new coreutil? shuffle - randomize file contents
From: |
Davis Houlton |
Subject: |
Re: new coreutil? shuffle - randomize file contents |
Date: |
Tue, 24 May 2005 21:46:13 +0000 |
User-agent: |
KMail/1.7.2 |
On Tuesday 24 May 2005 15:33, Frederik Eaton wrote:
> reason to expand the functionality of 'sort'. But in my opinion a more
> important reason is that the set of commands that one runs on a unix
> system comprise a language, which is a very important language from a
> user's perspective, and if people think that it should make sense
> intuitively to use a certain command to do a certain thing, then that
> command should be extended to do that thing, for the sake of the
> conciseness and clarity and usability of this language.
I agree that the coreutils represent a toolbox for programing. Without
them, /bin/sh would be an empty place indeed!
So, in the same way we have echo / printf, expand / unexpand, head / tail, I
like to think that there is also room for both sort and shuffle.
In my mind, from an functional operator's perspective, I find it very
intuitive that sort does as it says--sort lines ACB into ABC. Sorting by
dates and numbers are a logical extension of the sort metaphor, and again, I
intuitively look to sort to do these things. But when I went to look for a
shuffle command--as a user--I did not even think to look at the sort man
page. I'm not sure if that is typical -- sort is certainly a logical place to
look--but that was my experience.
A command like shuffle, in my mind, has a different set of functional
requirements--to take lines ACB and produce one of ABC ACB BAC BCA CAB CBA,
etc. As a user, not necessarily aware of technical implementation, sort and
shuffle make sense as separate commands, in the same way I use different
methods to sort and shuffle a deck of cards.
Reading these notes has given me pause to challenge my current line of
thinking regarding large files. Glossing over some gritty implementation
details, my revised thought process is that for each shuffle execution, there
are two paths: a computationally fast one involving manipulation of text
buffers, and one that involves only delimiter locations & line sizes.
Shuffle would pick the optimum execution path based on input set size; small,
average cases use text buffers. Larger cases use delimiter locations paged to
disk.
That is a good point about RAND_MAX. For a case where we have more than
RAND_MAX lines, we need to make multiple random calls until our net RAND_MAX
is large enough to encapsulate the entire input set. For example, suppose we
used a die to locate a random line. If we had 20 lines, we could use 4d6-3 to
give us the range we need (1-21). For 30 lines, 6d6-5 (1-31), etc.
I currently do not have time (tonight) to think through a scheme that uses
delimiter locations paged out to disk, but I suspect it would be implemented
differently from sort? Not sure yet--or if there would be any benefit.
Some considerations--are file sizes > fseek's fpos_t something that need to be
considered?
What is the maximum average footprint considered acceptable for a reduced
memory environment? 1MB? 4MB? 16k? 2k? Like Jay pointed out, the general
algorithm will involve splitting delimiter locations into manageable pages of
memory, and randomizing each page. Then we select a page at random, and walk
down that page. The question is then--what is the maximum page size?
Thanks,
Davis
- new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/05/21
- Re: new coreutil? shuffle - randomize file contents, James Youngman, 2005/05/23
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/05/25
- Re: new coreutil? shuffle - randomize file contents,
Davis Houlton <=
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/05/30
- Re: new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/05/30
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/05/30
- Re: new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/05/31