Re: Monit 4 enhancement requests

monit-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Monit 4 enhancement requests

From:	Martin Pala
Subject:	Re: Monit 4 enhancement requests
Date:	Thu, 25 Sep 2003 09:10:30 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030908 Debian/1.4-4

Jan-Henrik Haukeland wrote:

Jan-Henrik Haukeland <address@hidden> writes:

ad. 4.) alert limitation

Christian's suggestion; "alert [once] emailaddr".


When looking at this twice, I think we should drop this. The problem
is that it attacks the problem to well :) Because it will only send
one email alert while monit is running. Lets say apache goes down on
Monday and then later on Wednesday - you will then only get the Monday
alert and no further alerts from monit. This is clearly not what we
want.

The problem here is that when a service should stop during the night
and monit tries to start it unsuccessfully the whole night, the
mailbox will be full with alert messages when you wake up the next
morning.

BUT, it is exactly for this situation the TIMEOUT statement was
introduced a long time ago. If timeout was used, you will only get a
few alerts until the process times out.

So the answer to original request:

 ?  Alert limitation

When services that are checked often fail, monit spews a slew

    of alerts to the designated mailbox.  I'd like to see e.g. an
    "alertevery" keyword, that takes a time specification as its
    argument.  When != 0, monit should send out an alert only once
    per time period.

Is simply to use the timeout statement and I don't think it is
necessary for us to do anything more with this request than to
recommend usage of the timeout statement.

##

Martins proposal for a general statement on the form:

   IF x {event[, event]...} WITHIN y CYCLES THEN {timeout|alert}

could make sense in some cases, but I'm not sure that it is very
important to implement this. I must also admit that I have a problem
seeing the relevance of testing other stuff than processes within a
time period. That is, I'm not sure what kind of a problem would make
such statements necessary:

if 2 timestamp within 2 cycles then alert
if 6 timestamp within 6 cycles then timeout

(I'm not trying to be difficult here, I just cant see why testing a
e.g timestamp or checksum event over a time period is important).

This was taken from real life - as you know, i'm monitoring timestampsof database state files related to iPlanet messaging server. In the casethat error will occure (and with iPlanet products this is not unusual),monit floods with messages. However the timestamp need not to mean theerror in any case - it means problem. For example in the case that thedatabase is heavily loaded, timestamp will fail, but it could betemporary problem. I need to wait few hours before assuming hard errorand timeout the service and i don't need to be flooded be messages everycycle.


Similar case is the device - in the case that you have something like this:

if size > 80% then alert
if size > 99% then stop

In the period between 80-90% you will receive alert for each monitoringcycle. You need to be alerted at 80% watermark to allow solve theproblem before critical error will occure (extend or clear filesystem).You can't timeout the service after few alert cycles, because you needto stop the filesystem and all its dependant services gracefully in theemergency case.


However, what I _can_ see is that we should probably change the
timeout statement to support more actions. That is, make the timeout
statement similar to the other if-tests. Now the timeout statement
looks like this:

      IF NUMBER RESTART NUMBER CYCLE(S) THEN TIMEOUT

Here, the last part, THEN TIMEOUT, simply means, then unmonitor the
service. To make this statement more general, it should support the
other actions also, like so:

IF NUMBER RESTART NUMBER CYCLE(S) THEN {ALERT|RESTART|STOP|EXEC|UNMONITOR}

This will make it the same type of if-test as the other tests. And it
does make sense to support such action and not only unmonitor for
timeout. For instance upon a timeout the user may want to send a snmp
trap (it's called that, isn't it?) instead of an alert. This could be
done via the exec statement and a call an external program to send
this trap.

What do you think?

I think we should have some way to specify action in the case thatparticular event (of any type) reached specific ratio.


Martin

[Prev in Thread]

Current Thread

[Next in Thread]

Monit 4 enhancement requests, Jan-Henrik Haukeland, 2003/09/23
- Re: Monit 4 enhancement requests, Jan-Henrik Haukeland, 2003/09/23
  - Re: Monit 4 enhancement requests, Jan-Henrik Haukeland, 2003/09/23
- Re: Monit 4 enhancement requests, Martin Pala, 2003/09/24
  - Re: Monit 4 enhancement requests, Christian Hopp, 2003/09/24
  - Re: Monit 4 enhancement requests, Jan-Henrik Haukeland, 2003/09/24
    - Re: Monit 4 enhancement requests, Martin Pala, 2003/09/24
    - Re: Monit 4 enhancement requests, Jan-Henrik Haukeland, 2003/09/24
    - Re: Monit 4 enhancement requests, Jan-Henrik Haukeland, 2003/09/24
    - Re: Monit 4 enhancement requests, Martin Pala <=
    - Re: Monit 4 enhancement requests, Jan-Henrik Haukeland, 2003/09/25
    - Re: Monit 4 enhancement requests, Martin Pala, 2003/09/25

Prev by Date: Re: Monit 4 enhancement requests
Next by Date: Re: Monit 4 enhancement requests
Previous by thread: Re: Monit 4 enhancement requests
Next by thread: Re: Monit 4 enhancement requests
Index(es):
- Date
- Thread