Re: GNU Parallel More Frequently Check Nodelist

From:

Rob Sargent

Subject:

Date:

Wed, 16 Mar 2022 12:53:07 -0600

User-agent:

Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.1

On 3/15/22 17:41, Saydjari, Andrew wrote:

Hi all,

Reposting here from stack overflow upon request. Since the functionality doesn't currently exist, I guess this could be described as an enhancement suggestion. Let me describe the use case, and then the functionality I was proposing.

My GNU parallel use case is mostly to manage batch processing within SLURM on a HPC cluster. I know a few others in my community who also do, mostly at NERSC because of their documentation suggesting it (https://docs.nersc.gov/jobs/workflow/gnuparallel/). However, a lot of the larger academic computing groups often have group-owned machines on the cluster (which are outside of SLURM control) or have access to multiple different queues. I think it would be nice to be able to create the parent GNU parallel process on a machine that you own (and so it is always running) and when a SLURM allocation is granted on one queue or another, those machines just add their addresses to the nodelist of the GNU parallel job. This allows the job to keep running and make maximal use of fluctuating resources.

I think the only "feature" really needed to make this possible is a flag that changes how frequently the "nodelist" is checked. Personally, my tasks are often 8h+ and I wouldn't want to waste 8h of an allocation waiting for the parent process to have a task return before it checks the nodelist again.

Would be interested to hear if other people have similar use cases/would benefit and how hard it would be to add that functionality.

Thanks,
Andrew Saydjari

Andrew,
You don't describe you individual tasks: are they sufficiently resource intensive to require a full slurm node or are you trying to get multiple tasks on to any available slurm node (which is occupied by you)?

You would also need to remove the node from parallel's list of hosts after each job finished. Quite a bit of churn not to mention the synchronization necessary.

We have several "owner nodes" as well as general allocation. We have one type of task that is rather light-weight (single processor, no threading) and typically takes much less than max wall-clock(3 days here) to run. - and we can have thousands of those. We start a parallel job on an available slurm node and run a list of these jobs (more jobs than will likely get done within wall-clock time). (Wall-clock for our owner nodes is 14 day) so we send an even longer list of task to these nodes). Repeat until all tasks accomplished.

[Prev in Thread]

Current Thread

[Next in Thread]

GNU Parallel More Frequently Check Nodelist, Saydjari, Andrew, 2022/03/16

Re: GNU Parallel More Frequently Check Nodelist, Rob Sargent <=