monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Monit 4 enhancement requests


From: Martin Pala
Subject: Re: Monit 4 enhancement requests
Date: Thu, 25 Sep 2003 15:15:57 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030908 Debian/1.4-4

Jan-Henrik Haukeland wrote:

Martin Pala <address@hidden> writes:

Similar case is the device - in the case that you have something
like this:

if size > 80% then alert
if size > 99% then stop

In the period between 80-90% you will receive alert for each
monitoring cycle. You need to be alerted at 80% watermark to allow
solve the problem before critical error will occure (extend or clear
filesystem). You can't timeout the service after few alert cycles,
because you need to stop the filesystem and all its dependant services
gracefully in the emergency case.

I see your point.  The same problem applies to

if cpu > 40% then alert
if cpu > 99% then stop
if mem > 80Mb then alert
if mem > 150Mb then stop

And so on. Actually I think that there are two problems here. First,
there is a need to support timeout for other events than only process
restarts. Such as suggested:

IF x {event[, event]...} WITHIN y CYCLES THEN {timeout|timeout and exec}

*BUT* secondly and more important, changes to the timeout statement
does not directly solve the problem that alerts will be sent an mass
between e.g.:

if size > 80% then alert

and
if size > 99% then stop

Especially if timeout is _not_ used.  We *need* to handle this within
monit and not in the configure file. We need to implement an algorithm
in monit for each IF-TEST so only one alert is sent per test. Here we
show the algorithm for the above size tests:


boolean seen_80= false;
boolean seen_99= false;

while validate if(size > 80%) {
         if(not seen_80) {
           send alert; seen_80= true;
        }
      } else if(size > 99%) {
         if(not seen_90) {
           send alert; seen_99= true;
        }
      } else {
        seen_80= false;
        seen_90= false;
      }

This way, as long as the disk size grows upwards only one alert is
sent per test. When the disk size is back below 80% the flags are
reset so we can start over again. The same test should be used for cpu
and mem. Checksum and timestamp is already okay, since if there was a
change the old value is set to the new.

What do you think?

Good point, i'm +1 for such solution :)

Maybe it could be good to support "recovery" alerts to notify the user that the "failed" state is over in addition. The user then will be notified about the beggining and end of the error condition.

Martin








reply via email to

[Prev in Thread] Current Thread [Next in Thread]