bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel reports Deep recursion on subroutine "main::get_job_wi


From: bernard
Subject: Re: GNU Parallel reports Deep recursion on subroutine "main::get_job_with_sshlogin"
Date: Thu, 20 Jan 2011 16:55:18 +0100 (CET)
User-agent: SquirrelMail/1.4.15

> 2011/1/20  <address@hidden>:
>
>> We have version 20101222
>>
>> I am getting the following error:
>>
>> Deep recursion on subroutine "main::get_job_with_sshlogin" at
>> /usr/bin/parallel line 988, <STDIN> line 64964.
>
> Please show me the command you are running.

Here is a test example. The scripts do not contain the
real computation of course. But as an example it should
do the job.

This is the command:
./generate.sh | parallel --retries 17 --sshloginfile machines -j +0
--progress --nice 19 "echo {} | ./batch_dspath.sh" > output.txt

generate.sh:
#!/bin/bash

for I in {1..800000}; do echo $I;  done


batch_dspath.sh:
#!/bin/bash

sleep 10
HOST=`hostname`
read number
echo "${number} ${HOST}"

machines:
ssh -x arcikam # does not have ./batch_dspath.sh  !!
ssh -x drahokam
ssh -x kamizola
ssh -x camouflage
ssh -x kamasutra
ssh -x camorra
ssh -x kamerun
ssh -x kamisama
ssh -x campfire
ssh -x camellia
ssh -x okamzik
ssh -x cambodia
ssh -x kamzik
ssh -x turing



If I run the command and machine arcikam is commented out, it
works as expected
Computers / CPU cores / Max jobs to run
1:ssh -x cambodia / 4 / 4
2:ssh -x camellia / 2 / 2
3:ssh -x camorra / 4 / 4
4:ssh -x camouflage / 8 / 8
5:ssh -x campfire / 2 / 2
6:ssh -x drahokam / 4 / 4
7:ssh -x kamasutra / 2 / 2
8:ssh -x kamerun / 2 / 2
9:ssh -x kamisama / 4 / 4
10:ssh -x kamizola / 2 / 2
11:ssh -x kamzik / 2 / 2
12:ssh -x okamzik / 2 / 2
13:ssh -x turing / 4 / 4

Computer:jobs running/jobs completed/%of started jobs
1:4/20/10% 2:2/10/5% 3:4/20/10% 4:8/40/20% 5:2/8/4% 6:4/16/8% 7:2/8/4%
8:2/8/4% 9:4/16/8% 10:2/8/4% 11:2/9/4% 12:2/8/4% 13:4/20/10%
Computer:jobs running/jobs completed
1:4/24 2:2/12 3:4/24 4:8/48 5:2/12 6:4/22 7:2/12 8:2/12 9:4/24 10:2/12
11:2/12 12:2/12 13:4/24^C
....


If I leave arcikam in the file machine, I get this output:
Computers / CPU cores / Max jobs to run
1:ssh -x arcikam / 4 / 4
2:ssh -x cambodia / 4 / 4
3:ssh -x camellia / 2 / 2
4:ssh -x camorra / 4 / 4
5:ssh -x camouflage / 8 / 8
6:ssh -x campfire / 2 / 2
7:ssh -x drahokam / 4 / 4
8:ssh -x kamasutra / 2 / 2
9:ssh -x kamerun / 2 / 2
10:ssh -x kamisama / 4 / 4
11:ssh -x kamizola / 2 / 2
12:ssh -x kamzik / 2 / 2
13:ssh -x okamzik / 2 / 2
14:ssh -x turing / 4 / 4

Computer:jobs running/jobs completed/%of started jobs
1:4/97/70% 2:4/0/2% 3:2/0/1% 4:4/0/2% 5:8/0/5% 6:2/0/1% 7:4/0/2% 8:2/0/1%
9:2/0/1% 10:4/0/2% 11:2/0/1% 12:2/0/1% 13:2/0/1% 14:4/0/2%Deep recursion
on subroutine "main::get_job_with_sshlogin" at /usr/bin/parallel line 988,
<STDIN> line 144.
1:4/98/70% 2:4/0/2% 3:2/0/1% 4:4/0/2% 5:8/0/5% 6:2/0/1% 7:4/0/2% 8:2/0/1%
9:2/0/1% 10:4/0/2% 11:2/0/1% 12:2/0/1% 13:2/0/1% 14:4/0/2%Deep recursion
on subroutine "main::get_job_with_sshlogin" at /usr/bin/parallel line 988.
1:4/99/71% 2:4/0/2% 3:2/0/1% 4:4/0/2% 5:8/0/5% 6:2/0/1% 7:4/0/2% 8:2/0/1%
9:2/0/1% 10:4/0/2% 11:2/0/1% 12:2/0/1% 13:2/0/1% 14:4/0/2%Deep recursion
on subroutine "main::get_job_with_sshlogin" at /usr/bin/parallel line 988.
1:4/100/71% 2:4/0/2% 3:2/0/1% 4:4/0/2% 5:8/0/5% 6:2/0/1% 7:4/0/2% 8:2/0/1%
9:2/0/1% 10:4/0/2% 11:2/0/1% 12:2/0/1% 13:2/0/1% 14:4/0/2%Deep recursion
on subroutine "main::get_job_with_sshlogin" at /usr/bin/parallel line 988.
1:4/101/71% 2:4/0/2% 3:2/0/1% 4:4/0/2% 5:8/0/5% 6:2/0/1% 7:4/0/2% 8:2/0/1%
9:2/0/1% 10:4/0/2% 11:2/0/1% 12:2/0/1% 13:2/0/1% 14:4/0/2%Deep recursion
on subroutine "main::get_job_with_sshlogin" at /usr/bin/parallel line 988.
...

>> Also parallel seems to report some some jobs
>> as done although they are not done. It might be caues
>> by the previsous error.
>
> Please show an example that shows this behaviour.

Maybe I was just confused that parallel shows the jobs as done
but the job actually failed. So in some sense it was "done".

>> One more comment - if I kill parallel, programs at computers
>> where it was spreading overs ssh keep running. Which
>> might cause troubles if you need to restart computation
>> as the computers you want to use are already computing
>> the previous, but now killed, computation.
>
> Please show an example that shows this behaviour.

The provided example is exhibiting the behaviour. Only
with sleep 10 it is not as obvious. But with sleep 1000
the process at target machines are just waiting for the sleep to finish.
Although the main command is closed.

Best regards,

Bernard

PS: Sorry that I forgot subject in my previous email.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]