bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Add new program: psub


From: Bo Borgerson
Subject: Re: [PATCH] Add new program: psub
Date: Sat, 03 May 2008 12:13:38 -0400
User-agent: Thunderbird 2.0.0.12 (X11/20080227)

Bo Borgerson wrote:
> Hi,
> 
> This program uses the temporary fifo management system that I built
> for zargs to provide generic process substitution for arguments to a
> sub-command.
> 
> This program has some advantages over the process substitution built
> into some shells (bash, zsh, ksh, ???):
> 
> 1. It doesn't rely on having a shell that supports built-in process
> substitution.
> 2. By using descriptively named temporary fifos it allows programs
> that include filenames in output or diagnostic messages to provide
> more useful information than with '/dev/fd/*' inputs.
> 3. It supports `--files0-from=F' style argument passing, as well.
> 
> Also available for fetch at:
> 
> $ git fetch git://repo.or.cz/coreutils/bo.git psub:psub
> 

Hi,

I'd like to share another use for this tool.

As discussed previously there is a performance penalty when using `sort
-m' for exceeding a certain number of inputs (NMERGE).  Any more inputs
and temporary files will be used, which increases both I/O and CPU cost.

One way to avoid this extra cost is to increase NMERGE.  Another would
be to use tributary processes that each merge a subset of inputs and
feed into the main merge.  This has a potential added advantage on
multi-processor machines of spreading the workload among processors.

In the following example I have 32 inputs (named `0'..`31') each with
1048576 records.  Each record is a single character and there are
obviously large contiguous blocks of identical records.  NMERGE is 16
(the default).

----
$ time sort -m *

real    0m9.107s
user    0m6.380s
sys     0m0.300s

$ time for i in 012 3456 789; do echo $i | sed 's/.*/"<sort -mu
*\[&\]"/'; done | xargs psub sort -m

real    0m3.792s
user    0m3.744s
sys     0m0.052s
----

And just to give a sense of how that breaks down:

----
$ for i in 012 3456 789; do echo $i | sed 's/.*/"<ls *\[&\]"/'; done |
xargs psub wc -l
     11 /tmp/psubsUegiv/ls *[012]
     12 /tmp/psubsUegiv/ls *[3456]
      9 /tmp/psubsUegiv/ls *[789]
     32 total
----

With longer records and no identical records in a given input the
benefit of spreading the work across processors becomes more apparent.
The following is with 64 files with 262144 records each.  Each record is
4 characters long.  I have a Core 2 Duo.

----
$ time sort -m *

real    0m13.183s
user    0m12.793s
sys     0m0.376s

$ time for i in 01 23 45 67 89; do echo $i | sed 's/.*/"<sort -mu
*\[&\]"/'; done | xargs psub sort -m

real    0m6.660s
user    0m12.401s
sys     0m0.168s

$ for i in 01 23 45 67 89; do echo $i | sed 's/.*/"<ls *\[&\]"/'; done |
xargs psub wc -l
     14 /tmp/psubG0UkXb/ls *[01]
     14 /tmp/psubG0UkXb/ls *[23]
     12 /tmp/psubG0UkXb/ls *[45]
     12 /tmp/psubG0UkXb/ls *[67]
     12 /tmp/psubG0UkXb/ls *[89]
     64 total
----

The multi-process benefit is amplified on machines with more available
processors.  With the current trend of increasing numbers of on-die
processor cores I think this sort of easy technique for taking advantage
of concurrency is going to become more broadly beneficial.

Thanks,

Bo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]