monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Monit 4 enhancement requests


From: Jan-Henrik Haukeland
Subject: Re: Monit 4 enhancement requests
Date: Thu, 25 Sep 2003 14:16:18 +0200
User-agent: Gnus/5.1002 (Gnus v5.10.2) XEmacs/21.4 (Reasonable Discussion, linux)

Martin Pala <address@hidden> writes:

> Similar case is the device - in the case that you have something
> like this:
>
> if size > 80% then alert
> if size > 99% then stop
>
> In the period between 80-90% you will receive alert for each
> monitoring cycle. You need to be alerted at 80% watermark to allow
> solve the problem before critical error will occure (extend or clear
> filesystem). You can't timeout the service after few alert cycles,
> because you need to stop the filesystem and all its dependant services
> gracefully in the emergency case.

I see your point.  The same problem applies to

 if cpu > 40% then alert
 if cpu > 99% then stop
 if mem > 80Mb then alert
 if mem > 150Mb then stop

And so on. Actually I think that there are two problems here. First,
there is a need to support timeout for other events than only process
restarts. Such as suggested:

 IF x {event[, event]...} WITHIN y CYCLES THEN {timeout|timeout and exec}

*BUT* secondly and more important, changes to the timeout statement
does not directly solve the problem that alerts will be sent an mass
between e.g.:

 if size > 80% then alert

and 

 if size > 99% then stop

Especially if timeout is _not_ used.  We *need* to handle this within
monit and not in the configure file. We need to implement an algorithm
in monit for each IF-TEST so only one alert is sent per test. Here we
show the algorithm for the above size tests:


 boolean seen_80= false;
 boolean seen_99= false;

 while validate 
       if(size > 80%) {
          if(not seen_80) {
            send alert; seen_80= true;
         }
       } else if(size > 99%) {
          if(not seen_90) {
            send alert; seen_99= true;
         }
       } else {
         seen_80= false;
         seen_90= false;
       }

This way, as long as the disk size grows upwards only one alert is
sent per test. When the disk size is back below 80% the flags are
reset so we can start over again. The same test should be used for cpu
and mem. Checksum and timestamp is already okay, since if there was a
change the old value is set to the new.

What do you think?

-- 
Jan-Henrik Haukeland




reply via email to

[Prev in Thread] Current Thread [Next in Thread]