monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Monit 4 enhancement requests


From: Martin Pala
Subject: Re: Monit 4 enhancement requests
Date: Thu, 25 Sep 2003 09:10:30 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030908 Debian/1.4-4

Jan-Henrik Haukeland wrote:

Jan-Henrik Haukeland <address@hidden> writes:
ad. 4.) alert limitation
*
Christian's suggestion; "alert [once] emailaddr".

When looking at this twice, I think we should drop this. The problem
is that it attacks the problem to well :) Because it will only send
one email alert while monit is running. Lets say apache goes down on
Monday and then later on Wednesday - you will then only get the Monday
alert and no further alerts from monit. This is clearly not what we
want.

The problem here is that when a service should stop during the night
and monit tries to start it unsuccessfully the whole night, the
mailbox will be full with alert messages when you wake up the next
morning.

BUT, it is exactly for this situation the TIMEOUT statement was
introduced a long time ago. If timeout was used, you will only get a
few alerts until the process times out.

So the answer to original request:

 ?  Alert limitation
When services that are checked often fail, monit spews a slew
    of alerts to the designated mailbox.  I'd like to see e.g. an
    "alertevery" keyword, that takes a time specification as its
    argument.  When != 0, monit should send out an alert only once
    per time period.

Is simply to use the timeout statement and I don't think it is
necessary for us to do anything more with this request than to
recommend usage of the timeout statement.

##

Martins proposal for a general statement on the form:

   IF x {event[, event]...} WITHIN y CYCLES THEN {timeout|alert}

could make sense in some cases, but I'm not sure that it is very
important to implement this. I must also admit that I have a problem
seeing the relevance of testing other stuff than processes within a
time period. That is, I'm not sure what kind of a problem would make
such statements necessary:

if 2 timestamp within 2 cycles then alert
if 6 timestamp within 6 cycles then timeout

(I'm not trying to be difficult here, I just cant see why testing a
e.g timestamp or checksum event over a time period is important).

This was taken from real life - as you know, i'm monitoring timestamps of database state files related to iPlanet messaging server. In the case that error will occure (and with iPlanet products this is not unusual), monit floods with messages. However the timestamp need not to mean the error in any case - it means problem. For example in the case that the database is heavily loaded, timestamp will fail, but it could be temporary problem. I need to wait few hours before assuming hard error and timeout the service and i don't need to be flooded be messages every cycle.

Similar case is the device - in the case that you have something like this:

if size > 80% then alert
if size > 99% then stop

In the period between 80-90% you will receive alert for each monitoring cycle. You need to be alerted at 80% watermark to allow solve the problem before critical error will occure (extend or clear filesystem). You can't timeout the service after few alert cycles, because you need to stop the filesystem and all its dependant services gracefully in the emergency case.


However, what I _can_ see is that we should probably change the
timeout statement to support more actions. That is, make the timeout
statement similar to the other if-tests. Now the timeout statement
looks like this:

      IF NUMBER RESTART NUMBER CYCLE(S) THEN TIMEOUT

Here, the last part, THEN TIMEOUT, simply means, then unmonitor the
service. To make this statement more general, it should support the
other actions also, like so:

IF NUMBER RESTART NUMBER CYCLE(S) THEN {ALERT|RESTART|STOP|EXEC|UNMONITOR}

This will make it the same type of if-test as the other tests. And it
does make sense to support such action and not only unmonitor for
timeout. For instance upon a timeout the user may want to send a snmp
trap (it's called that, isn't it?) instead of an alert. This could be
done via the exec statement and a call an external program to send
this trap.

What do you think?

I think we should have some way to specify action in the case that particular event (of any type) reached specific ratio.

Martin







reply via email to

[Prev in Thread] Current Thread [Next in Thread]