monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uptime weirdness


From: Martin Pala
Subject: Re: uptime weirdness
Date: Thu, 19 Aug 2010 13:20:10 +0200

The next monit version (5.2) supports monitoring without pidfiles by 
specification of process pattern which is compared with running processes. This 
allows to watch processes without pidfiles and also don't depend on pidfile 
content.

Changelog excerpt:

--8<--
* Added support for monitoring processes without pidfile using pattern matching.
  You can use POSIX regular expressions or full strings maching process name.
  The process string corresponds to output of 'ps' utility. Some platforms
  like Mac OS X require super-user privileges to get full process name.
  The first match is used so this form of check is useful for unique
  pattern matching - the pidfile should be used where possible as it defines
  expected pid exactly (pattern matching won't be useful for Apache for 
example).
  Example usage (monitoring VMware virtual machine):
      check process vmware-debian matching "/usr/lib/vmware/bin/vmware-vmx 
.*debian4-x86.vmx"
          ...
--8<--


On Aug 19, 2010, at 2:56 AM, Gareth Pye wrote:

> Sorry for not replying earlier, your response had cleared up things for me.
> 
> Until today when it struck me how much of a huge bug this is. If a system is 
> power cycled (no normal shutdown procedure) so that the old pid files still 
> exist and some other random process is running with that pid then the task 
> that monit is meant to be monitoring will never be started.
> 
> The case I had just a few minutes ago was that the pid file ended up pointing 
> to monit it self.
> 
> Obviously the simple hack is to remove all pid files before starting monit 
> (or at least at some point in the boot procedure but before monit has started 
> the processes seams most efficient). Wouldn't it make sense for monit to 
> ensure that the pid files aren't older than the current system uptime? 
> Obviously a process can't have been running longer than the host system.
> 
> Gareth Pye
> Engineer
> GPSat Systems Australia
> address@hidden
> Ph: 03 9455 0041
> Fax: 03 9455 0042
> 
> 
> On 12/08/10 20:56, Martin Pala wrote:
>> I'm not sure what system uptime in your case is - the attached monit status 
>> output contains following uptimes only:
>> 
>> 1.) monit uptime: 49m  =>  monit was started 49 minutes ago (system itself 
>> may be running much longer - this uptime is updated whenever monit itself is 
>> (re)started)
>> 2.) process 'BoomDataToMODBUS' uptime: 45m
>> 3.) process 'DataRouter' uptime: 21h9m
>> 
>> =>  if the system was started less then 21h9m ago at the point when monit 
>> status was taken, then the reported uptime of DataRouter process is wrong. 
>> With monit-5.0.3 it could happen because it's based on the pidfile's 
>> timestamp. The next monit release (5.2) fixes this problem. Monit-5.2 
>> changelog excerpt:
>> 
>> --8<--
>> * Show real process uptime - formerly the presented uptime was based on 
>> create/modify
>>   timestamp of process' pidfile which provides invalid uptime if the pidfile 
>> is
>>   replaced and process keeps running with original PID (such as on apache 
>> reload).
>>   Thanks to Nima Chavooshi for report.
>> --8<--
>> 
>> Regards,
>> Martin
>> 
>> 
>> 
>> On Aug 12, 2010, at 2:01 AM, Gareth Pye wrote:
>> 
>>   
>>> I've just noticed that the uptime for one of my processes as reported by 
>>> monit is greater than the system time. Is this plausible?
>>> 
>>> The Monit daemon 5.0.3 uptime: 49m
>>> 
>>> Process 'BoomDataToMODBUS'
>>>  status                            running
>>>  monitoring status                 monitored
>>>  pid                               909
>>>  parent pid                        1
>>>  uptime                            45m
>>>  children                          0
>>>  memory kilobytes                  1880
>>>  memory kilobytes total            1880
>>>  memory percent                    1.4%
>>>  memory percent total              1.4%
>>>  cpu percent                       0.0%
>>>  cpu percent total                 0.0%
>>>  data collected                    Wed Aug 11 16:40:27 2010
>>> 
>>> Process 'DataRouter'
>>>  status                            running
>>>  monitoring status                 monitored
>>>  pid                               901
>>>  parent pid                        886
>>>  uptime                            21h 9m
>>>  monitoring status                 monitored
>>>  pid                               901
>>>  parent pid                        886
>>>  uptime                            21h 9m
>>>  children                          0
>>>  memory kilobytes                  3232
>>>  memory kilobytes total            3232
>>>  memory percent                    2.5%
>>>  memory percent total              2.5%
>>>  cpu percent                       0.0%
>>>  cpu percent total                 0.0%
>>>  data collected                    Wed Aug 11 16:40:27 2010
>>> 
>>> File 'user.config'
>>>  status                            accessible
>>>  monitoring status                 monitored
>>>  permission                        644
>>>  uid                               0
>>>  gid                               0
>>>  timestamp                         Wed Aug 11 15:50:56 2010
>>>  size                              1295 B
>>>  checksum                          deeaffe3f625e93f00aeead0a0a3abd5(MD5)
>>>  data collected                    Wed Aug 11 16:40:27 2010
>>> 
>>> Filesystem 'root'
>>>  status                            accessible
>>>  monitoring status                 monitored
>>>  permission                        755
>>>  uid                               0
>>>  gid                               0
>>>  filesystem flags                  0
>>>  block size                        4096 B
>>>  blocks total                      120169 [469.4 MB]
>>>  blocks free for non superuser     14286 [55.8 MB] [11.9%]
>>>  blocks free total                 14286 [55.8 MB] [11.9%]
>>>  inodes total                      134976
>>>  inodes free                       116759 [86.5%]
>>>  data collected                    Wed Aug 11 16:40:27 2010
>>> 
>>> System 'Test-Base'
>>>  status                            running
>>>  monitoring status                 monitored
>>>  load average                      [0.00] [0.00] [0.00]
>>>  cpu                               0.0%us 0.1%sy 0.0%wa
>>>  memory usage                      12892 kB [10.1%]
>>>  data collected                    Wed Aug 11 16:40:27 2010
>>> 
>>> -- 
>>> Gareth Pye
>>> Engineer
>>> GPSat Systems Australia
>>> address@hidden
>>> Ph: 03 9455 0041
>>> Fax: 03 9455 0042
>>> 
>>> 
>>> --
>>> To unsubscribe:
>>> http://lists.nongnu.org/mailman/listinfo/monit-general
>>>     
>> 
>> --
>> To unsubscribe:
>> http://lists.nongnu.org/mailman/listinfo/monit-general
>> 
>>   
> 
> --
> To unsubscribe:
> http://lists.nongnu.org/mailman/listinfo/monit-general




reply via email to

[Prev in Thread] Current Thread [Next in Thread]