Hi all,
Reposting here from stack overflow upon request. Since the
functionality doesn't currently exist, I guess this could be
described as an enhancement suggestion. Let me describe the
use case, and then the functionality I was proposing.
My GNU parallel use case is mostly to manage batch
processing within SLURM on a HPC cluster. I know a few others
in my community who also do, mostly at NERSC because of their
documentation suggesting it (
https://docs.nersc.gov/jobs/workflow/gnuparallel/).
However, a lot of the larger academic computing groups often
have group-owned machines on the cluster (which are outside of
SLURM control) or have access to multiple different queues. I
think it would be nice to be able to create the parent GNU
parallel process on a machine that you own (and so it is
always running) and when a SLURM allocation is granted on one
queue or another, those machines just add their addresses to
the nodelist of the GNU parallel job. This allows the job to
keep running and make maximal use of fluctuating resources.
I think the only "feature" really needed to make this
possible is a flag that changes how frequently the "nodelist"
is checked. Personally, my tasks are often 8h+ and I wouldn't
want to waste 8h of an allocation waiting for the parent
process to have a task return before it checks the nodelist
again.
Would be interested to hear if other people have similar
use cases/would benefit and how hard it would be to add that
functionality.
![]()