bug-textutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

discussion seeds


From: Andrew D Jewell
Subject: discussion seeds
Date: Tue, 19 Feb 2002 12:09:25 -0500

At Alexa, we have huge amounts of data (100's of terabytes) on a network of cheap UNIX machines (somewhere around 1000 such machines).

The standard textutils distribution needs some changes to be maximally useful to us in this environment. I would like to describe some of the changes we've made to textutils with the hope of generating some discussion about what parts might be appropriate to fold back into the main distribution, as well as discussion of the strategies themselves.

For no good reason, we name these tools with an av_ prefix; so when I mention av_sort (or whatever) I mean our version of sort.

Also please insert "when appropriate", "when possible" and such throughout the discussion below.


1) Distributed Computing and Named Pipes
One of our primary methodologies involves running things on a bunch of machines, and combining the results through named pipes. The two general rules that appear are a) read from all the files at once, rather than reading each completely in turn.
b) open all the files before reading from any of them.

For example, "sort -m" already does a), and requires very little effort to enforce b) as well.

av_cat reads what is available from each file, producing output with all the right lines in it, but merged in a non-deterministic order.


2) gzip
Rather than buying 2 or 3 times as many machines, we gzip almost everything. The Alexa versions of textutils replaces stdio with the zlib stdio-like interface, and thus can work on compressed or uncompressed files willy-nilly. (We also have a special way of zipping that lets you binary search (and otherwise randomly access) a zipped file, while still letting unmodified gunzip do the right thing, but that's not really on topic).


3) threads
For both performance, and for named pipe use, many tools end up being threaded. av_cat and av_split have one thread per file. av_sort has three threads, one thread each for reading, writing and sorting. I'm guessing threads as part of the standard textutils is not an option.

4) sort
Sorting hundreds of gigabytes can take a while. av_sort.c is rather dramatically different from sort.c, even though their output is identical. In addition to the threads mentioned above, we allow merges of arbitrary arity (instead of fixed at 16). A custom sort (based on qsort) for the usual non-stable case and a much larger default memory allocation, just to name a few.

5) big
Some tools can have problems with huge files; for example, the join patch I submitted last Thursday.

6) sorted order
Only slightly off topic : several textutils tools operate on sorted files. Unfortunately, the all seem to have a different interface for expressing the sort order, and different capabilities for sorting. Thus it isn't always possible to join against what you have just sorted. I'm toying with a shared module that interprets '--k' parameters and handles the comparisons. Has anyone else seen this need? Has anyone else come up with a solution?


Anyway, as I said I'm hoping for two things
1) some spirited discussion
2) some indication as to what changes should be submitted as patches


Andy Jewell
address@hidden





reply via email to

[Prev in Thread] Current Thread [Next in Thread]