[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
GNU Parallel Bug Reports feature suggestion: approximate scheduler
From: |
Håkan Johansson |
Subject: |
GNU Parallel Bug Reports feature suggestion: approximate scheduler |
Date: |
Thu, 3 Sep 2015 00:56:09 +0200 |
User-agent: |
Alpine 2.02 (DEB 1266 2009-07-14) |
Hi,
we are happily using GNU parallel to process data files on a diverse set
of workstations. What would be even nicer is if parallel could do some
sort of scheduling to reduce the waiting time for the last few files to
finish processing.
I did not find any options to that effect, please correct me if wrong.
Otherwise, description of a typical invocation:
- ~200 files of varying size (between 10 MB and 1 GB)
- the command to execute
- a list of 2 - 15 machines of varying speed (within a factor ~4)
and number of cores
The processing is single-threaded, so the list of machines also tell how
many jobs to run on each box.
We often reanalyse the same set of files, or sets with partial overlap of
files. The command changes a bit, but execution time/work per file is
rather similar. Execution time is roughly proportional to file size, but
depending on the actual content, some files may be involve more work.
What I would imagine is something like this:
- an option to request parallel to try to schedule the jobs such that they
end more simultaneously (and thus sooner). Given no other information,
the estimate would be that it takes time proportional to the input {}
file size to complete the job. I.e. start scheduling the largest
files and end with the smallest ones.
- an option to read a machine-performance file, which gives the relative
performance of the machines. (If some is missing, assume a default
value of e.g. 1).
Especially as the end is being reached, completetion may be quicker
if parallel would not use the slowest machines at all, or only for
small enough files.
- an option to read an file of input-file difficulty corrections, i.e.
telling (relatively) how 'large' each file effectively is.
The reason for having the relative measurements in external files is that
parallel cannot determine these values within one run. One has to learn
between invocations. But as input files and machines may be used in
different ways, i.e. with different commands, better have it in files that
the user can/has to change. In order to produce those two files:
- an option to produce a log file listing for each job:
date/time of job, machine, input-file, size-of-input-file, time used
A separate program could then process (multiple) such files to produce the
machine and input-file correction files. From just one invocation it will
not be possible, but as soon as either the same file has been processed on
multiple machines, those machines can get relative numbers. Or different
files on the same machine, then the files can be assigned effective sizes.
It will not be necessary to have executed all combinations as the
assignment can be treated as an overdetermined sparse least-squares
problem. With one equation from each job-invocation, where one expects:
file_size / time_used = machine_performance / file_difficulty
I.e. effectively measured in bytes/second. After taking logarithms:
log(machine_performance) - log(file_difficulty) =
= log(file_size) - log (time_used)
Where log(machine_performance) and log(file_difficulty) are the unknowns
that are solved for. One such equation is set up for each completed
machine-file job combination.
I have made a mock-up test of the equation solving by producing the
log-file by wrapping each job in a script that measures the time etc, and
then a small perl script to produce the equation system for solving with
e.g. octave. It produces machine performance and file difficulty values
which look reasonable. If interesting, I'll be happy to share the script
etc.
Cheers,
Håkan
- GNU Parallel Bug Reports feature suggestion: approximate scheduler,
Håkan Johansson <=