[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
discussion seeds
From: |
Andrew D Jewell |
Subject: |
discussion seeds |
Date: |
Tue, 19 Feb 2002 12:09:25 -0500 |
At Alexa, we have huge amounts of data (100's of terabytes) on a
network of cheap UNIX machines (somewhere around 1000 such machines).
The standard textutils distribution needs some changes to be
maximally useful to us in this environment. I would like to describe
some of the changes we've made to textutils with the hope of
generating some discussion about what parts might be appropriate to
fold back into the main distribution, as well as discussion of the
strategies themselves.
For no good reason, we name these tools with an av_ prefix; so when I
mention av_sort (or whatever) I mean our version of sort.
Also please insert "when appropriate", "when possible" and such
throughout the discussion below.
1) Distributed Computing and Named Pipes
One of our primary methodologies involves running things on a
bunch of machines, and combining the results through named pipes. The
two general rules that appear are
a) read from all the files at once, rather than reading each
completely in turn.
b) open all the files before reading from any of them.
For example, "sort -m" already does a), and requires very little
effort to enforce b) as well.
av_cat reads what is available from each file, producing output with
all the right lines in it, but merged in a non-deterministic order.
2) gzip
Rather than buying 2 or 3 times as many machines, we gzip almost
everything. The Alexa versions of textutils replaces stdio with the
zlib stdio-like interface, and thus can work on compressed or
uncompressed files willy-nilly. (We also have a special way of
zipping that lets you binary search (and otherwise randomly access) a
zipped file, while still letting unmodified gunzip do the right
thing, but that's not really on topic).
3) threads
For both performance, and for named pipe use, many tools end up
being threaded. av_cat and av_split have one thread per file. av_sort
has three threads, one thread each for reading, writing and sorting.
I'm guessing threads as part of the standard textutils is not an
option.
4) sort
Sorting hundreds of gigabytes can take a while. av_sort.c is
rather dramatically different from sort.c, even though their output
is identical. In addition to the threads mentioned above, we allow
merges of arbitrary arity (instead of fixed at 16). A custom sort
(based on qsort) for the usual non-stable case and a much larger
default memory allocation, just to name a few.
5) big
Some tools can have problems with huge files; for example, the
join patch I submitted last Thursday.
6) sorted order
Only slightly off topic : several textutils tools operate on
sorted files. Unfortunately, the all seem to have a different
interface for expressing the sort order, and different capabilities
for sorting. Thus it isn't always possible to join against what you
have just sorted. I'm toying with a shared module that interprets
'--k' parameters and handles the comparisons. Has anyone else seen
this need? Has anyone else come up with a solution?
Anyway, as I said I'm hoping for two things
1) some spirited discussion
2) some indication as to what changes should be submitted as patches
Andy Jewell
address@hidden
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- discussion seeds,
Andrew D Jewell <=