Jan-Henrik Haukeland <address@hidden> writes:
ad. 4.) alert limitation
*
Christian's suggestion; "alert [once] emailaddr".
When looking at this twice, I think we should drop this. The problem
is that it attacks the problem to well :) Because it will only send
one email alert while monit is running. Lets say apache goes down on
Monday and then later on Wednesday - you will then only get the Monday
alert and no further alerts from monit. This is clearly not what we
want.
The problem here is that when a service should stop during the night
and monit tries to start it unsuccessfully the whole night, the
mailbox will be full with alert messages when you wake up the next
morning.
BUT, it is exactly for this situation the TIMEOUT statement was
introduced a long time ago. If timeout was used, you will only get a
few alerts until the process times out.
So the answer to original request:
? Alert limitation
When services that are checked often fail, monit spews a slew
of alerts to the designated mailbox. I'd like to see e.g. an
"alertevery" keyword, that takes a time specification as its
argument. When != 0, monit should send out an alert only once
per time period.
Is simply to use the timeout statement and I don't think it is
necessary for us to do anything more with this request than to
recommend usage of the timeout statement.
##
Martins proposal for a general statement on the form:
IF x {event[, event]...} WITHIN y CYCLES THEN {timeout|alert}
could make sense in some cases, but I'm not sure that it is very
important to implement this. I must also admit that I have a problem
seeing the relevance of testing other stuff than processes within a
time period. That is, I'm not sure what kind of a problem would make
such statements necessary:
if 2 timestamp within 2 cycles then alert
if 6 timestamp within 6 cycles then timeout
(I'm not trying to be difficult here, I just cant see why testing a
e.g timestamp or checksum event over a time period is important).
However, what I _can_ see is that we should probably change the
timeout statement to support more actions. That is, make the timeout
statement similar to the other if-tests. Now the timeout statement
looks like this:
IF NUMBER RESTART NUMBER CYCLE(S) THEN TIMEOUT
Here, the last part, THEN TIMEOUT, simply means, then unmonitor the
service. To make this statement more general, it should support the
other actions also, like so:
IF NUMBER RESTART NUMBER CYCLE(S) THEN {ALERT|RESTART|STOP|EXEC|UNMONITOR}
This will make it the same type of if-test as the other tests. And it
does make sense to support such action and not only unmonitor for
timeout. For instance upon a timeout the user may want to send a snmp
trap (it's called that, isn't it?) instead of an alert. This could be
done via the exec statement and a call an external program to send
this trap.
What do you think?