monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket s


From: Christian Hopp
Subject: Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added]
Date: Tue, 6 Aug 2002 16:14:24 +0200 (CEST)

On 6 Aug 2002, Jan-Henrik Haukeland wrote:

Hi everyone!

> Christian Hopp <address@hidden> writes:
>

(...)

> > Let me cite "man mount" on this:
>
> >        The program accessing a file on a NFS mounted file system
> >        will hang when the  server crashes. The process cannot be
> >        interrupted or killed unless you also specify intr.
>
> Yes and I belive that alarm is exactly such an interrupt. If a process
> receives the alarm signal it _has_ to act on it. The default behavior
> is to terminate the process unless an alarm handler was installed. So
> alarm (2) should not have any problems jolting monit out off a file
> read block.

I have a attached a little bit of code to prove that you are wrong.
That program opens an iso from my nfs mounted home dir.  It is read
for 5 seconds than it get an time out.  I have run it once normally
and the second time I have pulled out my network.  I "time"ed it and
you can see...

(address@hidden) ~/compile/monit/tests> time ./nfs_test
Timeout occured!

            Time spent in user mode   (CPU seconds) : 0.030s
            Time spent in kernel mode (CPU seconds) : 0.810s
            Total time                              : 0:05.00s
            CPU utilisation (percentage)            : 16.8%
            Times the process was swapped           : 0
            Times of major page faults              : 62
            Times of minor page faults              : 152
Exit 1
(address@hidden) ~/compile/monit/tests> time ./nfs_test
Timeout occured!

            Time spent in user mode   (CPU seconds) : 0.030s
            Time spent in kernel mode (CPU seconds) : 0.780s
            Total time                              : 0:22.72s
            CPU utilisation (percentage)            : 3.5%
            Times the process was swapped           : 0
            Times of major page faults              : 62
            Times of minor page faults              : 127
Exit 1

The alarm signal is still evaluated... but later!  So as I told you a
NFS halted processes can't be woken up by anything.

I had even better stuff on my linux machine last week.  My linux has
halted processes which have tried to access /usr/include on ext3.
They have been unKILLable.  A reboot helped... to change it to
/usr/share/something.  But monit would be just stuck in that situation.

> > We have unfortunately a very unreliable network right now.
>
> You're in an excellent position to test this then :) I'm beting you a
> bottle of beer that alarm will work.

Our address is on our web page. (-: But for me something
non-alcoholic, please!

(...)

> > If we are in "what if" discussions here are some other things to think
> > about.
> >
> > * Monit checks a server which defuncs aka. is a zombie.  Is it in
> >   "good health" or not?  Pidfile and Pid do match.  I don't know what
> >   its ports do (do they still connect or not?).
>
> They should not accept a connection, but I'm not quite sure since the
> kernel handles socket connection and deliver them to the process.
> Anyway, a zombi process, even if it accept a connection should not
> pass the default connection test (the one with select) and of course
> not any protocol test.

Any other day I can test it, it's easy to make a zombie test bench. (-:

> But it's still a valid questions, especially for daemons without
> network code, like crond. This could be solved if I ever get around
> to hack the process status code I was planning to do (see item 6. in
> the next release plan). Maybe you would like to give it a stab?

I think I can do it.  I already took a look into the /proc access in
Linux and Solaris.  It's gonna be quite OS dependent.  I will give it
a generic frame work to access it OS independent, or can anyone
recommend me a good lib for it, but the backend is different on
esp. Linux and Solaris.

> > * A start/stop script returns with error, should monit still try to
> >   (re)start/stop the process?
>
> Good one, I thought about this someday when I was looking at the code.
> At least an alert message should be sent if monit cannot start the
> process. Now, only a log entry is made.

I mean some progs take a lot of CPU/MEM when they start, and when they
try every cycle.  Of course the timeout statement can help with it.
How should be dealt with it.  Not starting it again or just sending a
mail?


Bye,

Christian


-- 
Christian Hopp                                email: address@hidden
Institut für Elektrische Informationstechnik             fon: +49-5323-72-2113
Technische Universität Clausthal                         fax: +49-5323-72-3197
  pgpkey: https://www.iei.tu-clausthal.de/pgp-keys/chopp.key.asc  (2001-11-22)

Attachment: nfs_test.c
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]