monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New user with several major monit problems


From: Martin Pala
Subject: Re: New user with several major monit problems
Date: Sat, 10 Sep 2005 01:44:23 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050802 Debian/1.7.10-1


Jonathan Wheeler wrote:
1st and most major problem.
"monit -g node1 stop all" kills every process on the system, not just
the services in the node1 group, nor even the services described in the
monitrc file... no no, it kills EVERYTHING, even the local console
session on the machine is kicked out, along with monit itself!

This happens with monit running as a standalone daemon, or directly from
init. I've tried this on two completely different machines, one debian,
and one gentoo, with versions 4.5.1 and 4.5.

Monit is not able to stop/kill any process itself. When you call 'monit stop all', it just calls stop method of all services (those defined by 'stop program' option).

In the case that your system dies, it is cause probably by some of your stop scripts (this is not part of monit).

Here is the example - two groups: 'ldap' and 'sql' contains one service each. There is third service which is not part of any group:

--8<--
set daemon  5
set logfile /var/log/monit
set mailserver 127.0.0.1
set alert address@hidden
set httpd port 2812 and
    allow  127.0.0.1
    use address 127.0.0.1

check process slapd with pidfile /var/run/slapd/slapd.pid
   start program = "/etc/init.d/slapd start"
   stop program = "/etc/init.d/slapd stop"
   if failed host 127.0.0.1 port 389 protocol ldap3 then restart
   group ldap

check process mysql with pidfile /var/run/mysqld/mysqld.pid
   start program = "/etc/init.d/mysql start"
   stop program = "/etc/init.d/mysql stop"
   if failed host 127.0.0.1 port 3306 protocol mysql then restart
   group sql

check directory bin path /bin
   start program = "/bin/true"
   stop program = "/bin/true"
   if failed permission 755 then alert
--8<--

1.) monit is running in daemon mode, all services are working:

unicorn:~/cvs/monit# ./monit summary
The monit daemon 4.6 uptime: 8m

System 'unicorn'                    [0.14] [0.15] [0.15]
Process 'slapd'                     running
Process 'mysql'                     running
Directory 'bin'                     accessible

2.) ldap group is stopped (as you can see the system keeps running, slapd was stopped and unmonitored):

unicorn:~/cvs/monit# ./monit -g ldap stop all
unicorn:~/cvs/monit# ./monit summary
The monit daemon 4.6 uptime: 8m

System 'unicorn'                    [0.41] [0.23] [0.17]
Process 'slapd'                     not monitored
Process 'mysql'                     running
Directory 'bin'                     accessible


3.) ldap group started again:

unicorn:~/cvs/monit# ./monit -g ldap start all
unicorn:~/cvs/monit# ./monit summary
The monit daemon 4.6 uptime: 10m

System 'unicorn'                    [0.40] [0.25] [0.18]
Process 'slapd'                     running
Process 'mysql'                     running
Directory 'bin'                     accessible


4.) even all services can be stopped without affecting the system:

unicorn:~/cvs/monit# ./monit stop all
unicorn:~/cvs/monit# ./monit summary

The monit daemon 4.6 uptime: 13m

System 'unicorn'                    [0.30] [0.23] [0.18]
Process 'slapd'                     not monitored
Process 'mysql'                     not monitored
Directory 'bin'                     not monitored


2nd, and related problem.
Groups don't work.
monit -g weeelookat me start, or monit -g abcdefg -V, give exactly the
same results as monit without -g. monit -g node1 status, is also the
same as monit status.

The group (-g) option is supported just by following arguments:

start
stop
monitor
unmonitor
restart

The 'status' as well as 'summary' will realy display the status of all services ... currently this is feauture, maybe we should change it ...


Most annoyingly, for my cluster monit -g node1 stop all (as taken
directly from your documentation) kills the *entire* server (see problem 1)

Cannot be caused by monit - see above.


3rd issue.
Dependencies, it would appear that monit won't wait in between
dependencies. In my case I have it set to start drbd, followed by mount,
and finally starting nfs.
When I issue an monit start nfs, it attempts to start all 3 services in
the space of 1 second, which of course fails horribly as each takes a
little while to start up.

When using dependency, monit currently doesn't check whether the right started service in the chain is running before starting the its dependants. It just provides the correct start order, when some of your service chain prerequisite link starts slowly, you should modify the start scripts of the dependant services to wait for the parent service to be running.

You can use for example simple fixed 'sleep' in the start script or use some method to check whether the service is running - for example you can use the 'monit summary'. The following example will return 1 if slapd is running or 0 otherwise:

monit summary | awk '/slapd/ {exit !($3 != "running")}';

You can then incorporate this test to the start script in the loop which will wait for service to start - for example:

--8<--
while monit summary | awk '/slapd/ {exit !($3 != "running")}'
do
 sleep 5
done
start_service()
--8<--

(you can also make some give-up counter when the prerequisite service remains down for long time, etc.)


Martin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]