Re: GNU Parallel Bug Reports Job failure semantics

bug-parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel Bug Reports Job failure semantics

From:	Ole Tange
Subject:	Re: GNU Parallel Bug Reports Job failure semantics
Date:	Tue, 13 Sep 2011 15:40:19 +0200

On Tue, Sep 13, 2011 at 3:15 PM, Andreas Bernauer <address@hidden> wrote:
> On 9/13/11 12:43, Ole Tange wrote:
>>
>> We could implement a probe to test which machines were up when
>> starting. That way you only would have problems with hosts that went
>> offline after starting:
>>
>>   cat .ssh/pre_sshloginfile | parallel -j0 ssh server.example.com echo
>> server.example.com>  .ssh/sshloginfile
>>
>> But these would take time and slow down the first connection.
>
> If you keep the first connection, you can reuse it for the second, without
> much slowdown. See settings ControlMaster and ControlPath in ssh_config(5).

See -M in the man page for GNU Parallel. As I never use -M it would be
good if someone could test it and give some feedback.

> This won't really solve the problem with hosts that went offline after
> starting, but you have this problem anyways (eg., during script execution on
> the remote host).

How about this idea:

If a job fails on remote.example.com, try "ssh remote.example.com
true". If that fails too, then mark this machine as dead and do not
use this machine again in this run. Retry the job on another host.

That would deal with servers going offline after starting. It would
not discover if the server went online at a later time. This could be
done with a timeout on how long servers should be considered dead
(probably 1 hour or so):

If a job fails on remote.example.com, try "ssh remote.example.com
true". If that fails too, then mark this machine as dead FOR THE NEXT
HOUR and do not use this machine again DURING THAT TIME.

> {On a side note, I'm afraid parallel gets too bloated with many features
> that could be handled by other tools; for example, this very problem could
> be remedied by the script mentioned above and adding the '-Nf' options to
> ssh. Maybe there should be a contrib section for parallel?}

I used to feel that way, but I have come to the conclusion that it is
important to have good defaults - even if these cost some
performance/bloat. E.g. the --group option could be done by a wrapper
script, but then users would have to deal with that.

/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

GNU Parallel Bug Reports Job failure semantics, Alastair Andrew, 2011/09/12
- Re: GNU Parallel Bug Reports Job failure semantics, Ole Tange, 2011/09/13
  - Re: GNU Parallel Bug Reports Job failure semantics, Andreas Bernauer, 2011/09/14
    - Re: GNU Parallel Bug Reports Job failure semantics, Ole Tange <=

Prev by Date: GNU Parallel Bug Reports csh fix not working for me
Next by Date: Re: GNU Parallel Bug Reports Job failure semantics
Previous by thread: Re: GNU Parallel Bug Reports Job failure semantics
Next by thread: GNU Parallel Bug Reports csh fix not working for me
Index(es):
- Date
- Thread