discussion seeds

bug-textutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

discussion seeds

From:	Andrew D Jewell
Subject:	discussion seeds
Date:	Tue, 19 Feb 2002 12:09:25 -0500

At Alexa, we have huge amounts of data (100's of terabytes) on anetwork of cheap UNIX machines (somewhere around 1000 such machines).

The standard textutils distribution needs some changes to bemaximally useful to us in this environment. I would like to describesome of the changes we've made to textutils with the hope ofgenerating some discussion about what parts might be appropriate tofold back into the main distribution, as well as discussion of thestrategies themselves.

For no good reason, we name these tools with an av_ prefix; so when Imention av_sort (or whatever) I mean our version of sort.

Also please insert "when appropriate", "when possible" and suchthroughout the discussion below.



1) Distributed Computing and Named Pipes

One of our primary methodologies involves running things on abunch of machines, and combining the results through named pipes. Thetwo general rules that appear area) read from all the files at once, rather than reading eachcompletely in turn.

b) open all the files before reading from any of them.

For example, "sort -m" already does a), and requires very littleeffort to enforce b) as well.

av_cat reads what is available from each file, producing output withall the right lines in it, but merged in a non-deterministic order.



2) gzip

Rather than buying 2 or 3 times as many machines, we gzip almosteverything. The Alexa versions of textutils replaces stdio with thezlib stdio-like interface, and thus can work on compressed oruncompressed files willy-nilly. (We also have a special way ofzipping that lets you binary search (and otherwise randomly access) azipped file, while still letting unmodified gunzip do the rightthing, but that's not really on topic).



3) threads

For both performance, and for named pipe use, many tools end upbeing threaded. av_cat and av_split have one thread per file. av_sorthas three threads, one thread each for reading, writing and sorting.I'm guessing threads as part of the standard textutils is not anoption.


4) sort

Sorting hundreds of gigabytes can take a while. av_sort.c israther dramatically different from sort.c, even though their outputis identical. In addition to the threads mentioned above, we allowmerges of arbitrary arity (instead of fixed at 16). A custom sort(based on qsort) for the usual non-stable case and a much largerdefault memory allocation, just to name a few.


5) big

Some tools can have problems with huge files; for example, thejoin patch I submitted last Thursday.


6) sorted order

Only slightly off topic : several textutils tools operate onsorted files. Unfortunately, the all seem to have a differentinterface for expressing the sort order, and different capabilitiesfor sorting. Thus it isn't always possible to join against what youhave just sorted. I'm toying with a shared module that interprets'--k' parameters and handles the comparisons. Has anyone else seenthis need? Has anyone else come up with a solution?



Anyway, as I said I'm hoping for two things
1) some spirited discussion
2) some indication as to what changes should be submitted as patches


Andy Jewell
address@hidden

[Prev in Thread]

Current Thread

[Next in Thread]

discussion seeds, Andrew D Jewell <=

Prev by Date: textutils-2.0.21: build failure on IBM AIX 4.3 with native cc
Next by Date: history
Previous by thread: textutils-2.0.21: build failure on IBM AIX 4.3 with native cc
Next by thread: history
Index(es):
- Date
- Thread