bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel Bug Reports user-specified action on idle remote hosts with


From: Matthews, Gregory A. (ARC-TN)[InuTeq, LLC]
Subject: GNU Parallel Bug Reports user-specified action on idle remote hosts with no more jobs to run
Date: Tue, 15 Jan 2019 07:04:31 +0000

Hi,

NASA Ames makes GNU parallel available to users of its clusters 
(https://www.nas.nasa.gov/hecc/support/kb/using-gnu-parallel-to-package-multiple-jobs-in-a-single-pbs-job_303.html),
 and we've been considering how GNU parallel could better integrate with our 
resource manager/batch scheduler (PBS) to minimize wasted compute cycles. A 
simple use case is that a user requests some number of hosts for a PBS session 
where GNU parallel is used to distribute jobs to the assigned hosts. When all 
jobs have been distributed (but not yet completed) it would be nice if GNU 
parallel had a mechanism to take a user-specified action on those hosts which 
are no longer running any jobs. In our case we would want to run a particular 
PBS command on the local host that would remove the no-longer-running-jobs 
hosts from the PBS session, returning them to the pool of hosts that other 
users can use in their PBS sessions.

I'd be happy to give a more concrete explanation of the above use case if that 
would help. As I've looked over the tutorial, man page, mailing lists and perl 
code itself I see how the management of remote hosts has become more 
sophisticated over time, and this seems to me another step toward more dynamic 
management of remote hosts. GNU parallel knows* when a remote host will remain 
idle because there are no more jobs left to distribute, user-specified action 
at that point opens up a number of possibilities for marking/shedding/??? those 
hosts.

* I know there are wrinkles, such as --retries where it might be best to hang 
on to some number of currently-idle remote hosts to satisfy --retries

Our current focus is on the SSHLogin structure, obtaining a mechanism to call 
the user-specified action on each SSHLogin that transitions to permanently-idle 
(again, noting the asterisk statement above). From what I can tell, 
drain_job_queue() would be the place to scan all SSHLogin's to find 
permanently-idle instances (I'm not entirely clear how SSHLogin and 
--sqlmaster/--sqlworker intersect here). Then reaper() would allow detection of 
future permanently-idle SSHLogin instances.

-Greg Matthews



reply via email to

[Prev in Thread] Current Thread [Next in Thread]