bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel Bug Reports Signal SIGCHLD received, but no signal handler


From: Rick Masters
Subject: GNU Parallel Bug Reports Signal SIGCHLD received, but no signal handler set
Date: Tue, 11 Oct 2016 19:41:45 +0000

I believe I have isolated an issue reported here recently

and elsewhere and I would like to suggest a solution.

 

http://lists.gnu.org/archive/html/parallel/2016-07/msg00011.html

http://stackoverflow.com/questions/39754323/how-to-avoid-sigchld-error-in-bash-script-that-uses-gnu-parallel

 

The problem does indeed appear to be a result of the line:

delete $SIG{CHLD};

 

The line above appears to be functionally equivalent [1] to:

$SIG{CHLD}="DEFAULT";

 

But the first line has a problem while the second does not.

 

--

[1] I ran them under strace and the sigaction calls are the same.

Also, parallel appears to behave the same way with either line.

In contrast, this does not work for me:

$SIG{CHLD}="IGNORE";

When I try that, parallel seems to stall. Also, the strace

sigaction calls looks different than the other two variations.

Also, this discussion seems to indicate that IGNORE is an alternative

to doing waitpid manually (and parallel does use waitpid):

http://www.perlmonks.org/?node_id=1047688

--

 

The cause of the problem, which may only affect older versions of perl,

appears to be that the delete command is not resilient to receiving

multiple signals in a short period of time.

Somehow perl can receive a signal at a time when it does not know

what to do with it because the pointer it keeps internally to

process the signal is null, so it aborts.

 

This is from the source code of perl:

 

   if (!PL_psig_ptr[sig]) {

                PerlIO_printf(Perl_error_log, "Signal SIG%s received, but no signal handler set.\n",

                                 PL_sig_name[sig]);

                exit(sig);

        }

 

This code was probably added to defend against unsolved signal handling bugs

that would otherwise crash perl, like this report from long ago:

http://markmail.org/thread/da7bde4lmcmh2h3b

 

Recent versions of perl do not seem to be vulnerable to the problem.

I can reproduce this with perl 5.10.1 on centos 6.

I cannot reproduce this with perl 5.16.3 on centos 7.

 

Unfortunately, I'm stuck supporting the old version for years to come.

It would be greatly appreciated if you could change the line in question

for your next version. I think changing the code to assign the handler to

a specific documented value not only fixes the problem, but improves

clarity and appears to be better supported by perl.

 

 

Here are the details for how this can be reproduced.

 

With parallel, I use the following test command, which is designed to end

precisely when the time advances to the next second. This allows multiple

instances of this script to end roughly at the same time:

 

~bash$ parallel --version

GNU parallel 20160922

<snip>

 

~bash$ cat sleepshort

#!/bin/bash

 

sleep 1

foo="$(date)"

while [ "$(date)" = "$foo" ]; do

   printf ""

done

 

Then, many of these are launched at the same time:

 

~bash$ parallel --jobs $((20 + $RANDOM % 50)) -D run -v ./sleepshort ::: {1..200}

 

The problem may not happen on the first attempt, so this can be used:

 

# while parallel --jobs $((20 + $RANDOM % 50)) -D run -v ./sleepshort ::: {1..200}; do true; done

 

For me, this only takes several attempts (within a few minutes) to reproduce the problem.

 

Also, there is a more direct way of reproducing the underlying perl issue.

 

Run this perl script on one terminal to handle signals and run the second command

on another terminal to send signals to the first.

 

The second script depends on the names of this script, so call this "sigtest.pl":

#!/usr/bin/perl

 

while (1) {

    print "adding handler\n";

    $SIG{CHLD} = sub { print "gotchild\n"; };

    print "deleting handler\n";

    delete $SIG{CHLD};

}

 

 

Run it:

~bash$ ./sigtest.pl

 

Then, run this bash script on another terminal:

#!/bin/bash

 

pid=$(ps x | grep sigtest |grep -v grep |awk '{print $1}')

 

while kill -SIGCHLD $pid; do

  true

done

 

This instantly reproduces the problem for me. The first command exits with:

Signal SIGCHLD received, but no signal handler set

But if I replace the delete as suggested, the problem does not occur.

 

Please let me know if you need any more information and thank you

for the work you do on parallel.

 

 

Regards,

Rick Masters

F5 Networks

 


reply via email to

[Prev in Thread] Current Thread [Next in Thread]