monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CPU usage in Monit


From: Martin Pala
Subject: Re: CPU usage in Monit
Date: Thu, 8 Dec 2011 13:50:21 +0100

Hi,

thanks for update. I have prepared the debug version, which logs the values computed based on /proc/stat right when they are ready and once again before the values are checked, so we can see whether the values were read+computed correctly and whether no memory corruption occurred before they were compared by the validation engine => there are two "CPUDEBUG" log entries per cycle.

You can get it here: http://www.mmonit.com/tmp/monit-5.3.1p2.tar.gz

To compile:
tar -xzf monit-5.3.1p2.tar.gz
cd monit-5.3.1p2
./configure
make

Then stop existing monit instance and run new monit binary:
./monit -vI  2>&1 | grep CPUDEBUG

after you'll replicate the problem, terminate monit with ^C and send the whole CPUDEBUG output since monit start

Regards,,
Martin


On Dec 8, 2011, at 11:39 AM, Lawrence, Wayne wrote:

Hi Martin just as a side note here i disabled the cpu ssystem test and tried again and it seems that the issue is present with all the cpu monitoring/
 
I used the restarting of httpd as i knew it would trigger and alert and these were the results.
 
Date:        Thu, 08 Dec 2011 10:27:59
      Action:      alert
      Host:        <hostname removed>
      Description: cpu user usage of 100.0% matches resource limit [cpu user usage>70.0%]

 

I ran vmstat 1 10 at the same time as you can see its the 4th line.
 

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 739220 142536 973532    0    0     4     7   10    6  0  0 99  0  0
 0  0      0 739088 142536 973532    0    0     0     0  114  160  0  1 99  0  0
 3  0      0 739088 142536 973536    0    0     0     0  126  169  1  2 97  0  0
 0  0      0 737336 142536 973544    0    0     0   168  721  796 35 14 50  1  0
 1  0      0 736964 142536 973544    0    0     0     0  109  160  1  1 98  0  0
 
and just to make it a little simpler i ran sar 1 10 as well as it is more human readable.
 
10:27:55        CPU     %user     %nice   %system   %iowait    %steal     %idle
10:27:56        all      1.01      0.00      1.01      0.00      0.00     97.98
10:27:57        all      0.00      0.00      1.00      0.00      0.00     99.00
10:27:58        all      3.96      0.00      3.96      0.00      0.00     92.08
10:27:59        all     32.00      0.00     12.00      1.00      0.00     55.00
 
Something struck me as odd while testing this yesterdays results reporting 50% system usage from 15.84% actual means the reported usage is 3.2 times the actual. todays reported user usage of 100% is 3.2 times the actual 32%. so it seems just need to work out why it is multiplying the results.
 
Regards
 
Wayne

On 7 December 2011 11:43, Lawrence, Wayne <address@hidden> wrote:
Hi Martin,
 
I downloaded the source from the Monit website and compiled it on the server.
I have started monit in verbose mode and this is the relevant information it outputs when the event occurs.
 
 cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
-------------------------------------------------------------------------------
   ../tools/bin/monit() [0x41a533]
 ../tools/bin/monit(LogError+0x9f) [0x41ad2f]
   ../tools/bin/monit(Event_post+0x328) [0x417ba8]
    ..t/tools/bin/monit() [0x428071]
    ../tools/bin/monit(check_system+0x2b) [0x4285bb]
    ../tools/bin/monit(validate+0x226) [0x42ad16]
   ../tools/bin/monit() [0x41422d]
    ../tools/bin/monit(main+0x511) [0x4149e1]
    /lib64/libc.so.6(__libc_start_main+0xfd) [0x3592c1ecdd]
    ../tools/bin/monit() [0x40b179]
-------------------------------------------------------------------------------
Unfortunately remote access is not an option but I will happily run a debug version to try and track down this problem as I really would like to use Monit for my current build.
 
Regards
 
Wayne
On 7 December 2011 11:17, Martin Pala <address@hidden> wrote:
Thanks for data.

The /proc/stat format is this: 

    cpu <user> <nice> <system> <idle> <wait> <irq> <softirq>

The values count the cpu cycles, so if we subtract the corresponding values from your output, we get this:

                   user   nice   system   idle   wait   irq   softirq   |   total
09:57:35    1         0        1              99     0       0      0          |    101
09:57:36    1         0        0              98     0       0      0          |    99
09:57:37    25       0        16           59     1       0      0          |    101
09:57:38    1         0        2              98     0       0      0          |    101

=> at  09:57:37 the cpu usage was:

user      = 24.75%
system =  15.84%
wait      =   0.99%

This corresponds to the previous vmstat output. Monit counts the cpu usage the same way as above and doesn't modify these values => your monit really reports strange cpu usage (reported 50% vs. real ~ 16%).

What's the origin of your monit binary? Did you compile it from original source code or some 3rd party source code distibution? (such as RHEL or Fedora repository). Or do you use the pre-compiled binaries from www.mmonit.com? Or some 3rd party binary, patches or source code from other site?

Please can you try to run monit in verbose mode and provide full output?:

   1.) stop monit
   2.) run monit in foreground with verbose mode enabled:
       ./monit -vI
   3.) after the problem happens, stop monit with "^C" and send output

I can also prepare debug version which will dump the cpu usage related informations or if you can provide remote access to the system, i can troubleshoot the problem remotely.


Regards,
Martin



On Dec 7, 2011, at 11:07 AM, Lawrence, Wayne wrote:

Hi Martin,
 
this is the output of the commands you requested.
 
1.) uname -m
 
x86_64
 
2.) file `which monit`
 
 ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, not stripped
I ran the command you supplied to get the cup usage directly as well while restarting the httpd service as i know this will generate an alert.
 

 

      Date:        Wed, 07 Dec 2011 09:57:37
      Action:      exec
      Host:        <hostname removed>
      Description: cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]

Wed Dec  7 09:57:34 GMT 2011
cpu  207060 501 103542 49452254 25303 83 1569 0 0
Wed Dec  7 09:57:35 GMT 2011
cpu  207061 501 103543 49452353 25303 83 1569 0 0
Wed Dec  7 09:57:36 GMT 2011
cpu  207062 501 103543 49452451 25303 83 1569 0 0
Wed Dec  7 09:57:37 GMT 2011
cpu  207087 501 103559 49452510 25304 83 1569 0 0
Wed Dec  7 09:57:38 GMT 2011
cpu  207088 501 103561 49452608 25304 83 1569 0 0
Wed Dec  7 09:57:40 GMT 2011
If my understanding of /proc/stat is coreect this still doesnt make any sense but i may be wrong.
 
Regards
 
Wayne
 

 
On 7 December 2011 09:37, Martin Pala <address@hidden> wrote:
Please can you check that your monit binary matches the system architecture? (i.e. for example 64-bit monit binary on 64-bit system - not 32-bit monit on 64-bit system) 

To verify provide please the output of following commands:
1.) uname -m
2.) file `which monit`

Monit takes the statistics from the /proc/stat kernel interface. You can collect the statistics manually like this - for example to fetch the state in 1 second intervals (30 samples):

$ for ((i=0; i<30; i++)); do date; grep "cpu " /proc/stat; sleep 1; done

Note: monit takes the first /proc/stat line ("cpu") which contains the overall cpu usage in the system (summary of all cpus). The /proc/stat also contains per-cpu statistics if you want to collect all the statistics, replace the "grep 'cpu '" simply with "cat".

Regards,
Martin


On Dec 7, 2011, at 10:04 AM, Lawrence, Wayne wrote:

Hi Martin,
 
I have tried various methods to dientify the cause of this and took your advice and used vmstat. I simply restarted the httpd process from the monit web interface while the comand was running and got the following warning.
 
      Description: cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
 
But vmstat doesnt show that level of usage at the point of alert. As you can see there is some usage in the 3rd line of the output when i restarted the httpd service but it doesnt seem enough to trigger an alert.
 
vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 859596 114684 856908    0    0     4     6   81   77  0  0 99  0  0
 0  0      0 859448 114684 856916    0    0     0     0  100   94  1  0 99  0  0
 0  0      0 898352 114692 815600    0    0     0   168  555  605 23 15 61  1  0
 
Not sure if there are any other tests i can run to narrow this down a bit further as it still isn't making sense.
 
Regards
 
Wayne
 
 


 
On 7 December 2011 08:27, Martin Pala <address@hidden> wrote:
Hi Lawrence,

the test which triggers the alert is "system" cpu => it's the time the system spend in kernel mode. The cpu usage could be triggered by some background kernel task, to verify the monit report matches the system cpu usage, you should use either "vmstat" or "top" instead of "ps".

Best regards,
Martin 


On Dec 6, 2011, at 1:19 PM, Lawrence, Wayne wrote:

Hi Igor,
 
the operating system is RHEL6 and monit version is 5.3.1
 
this is what i have in my config
 
    if cpu usage (user) > 70% then alert
    if cpu usage (system) > 30% then alert
    if cpu usage (wait) > 20% then alert

this is one of the errors
Description: cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
 
this is what i get in /var/log/messages
Dec  6 12:01:29 <hostname-removed> monit[864]: <hostname-removed> cpu system usage of 50.0% matches resource limit [cpu system usage>30.0%]
Dec  6 12:02:29 <hostname-removed> monit[864]: <hostname-removed><hostname-removed>' cpu system usage check succeeded [current cpu system usage=0.9%]
 
this is the output of ps --no-headers -A -o "%cpu sz ucomm" | sort -k1nr | head -20
 
 12:01:29 up 4 days, 20:24,  2 users,  load average: 0.04, 0.01, 0.00
             total       used       free     shared    buffers     cached
Mem:       2055108    1092176     962932          0      53156     811864
-/+ buffers/cache:     227156    1827952
Swap:      4128760          0    4128760
 1.2 44308 perl
 0.0     0 aio/0
 0.0     0 async/mgr
 0.0     0 ata/0
 0.0     0 ata_aux
 0.0     0 bdi-default
 0.0     0 cpuset
 0.0     0 crypto/0
 0.0     0 events/0
 0.0     0 ext4-dio-unwrit
 0.0     0 flush-253:0
 0.0     0 jbd2/dm-0-8
 0.0     0 kacpi_hotplug
 0.0     0 kacpi_notify
 0.0     0 kacpid
 0.0     0 kauditd
 0.0     0 kblockd/0
 0.0     0 kdmflush
 0.0     0 khelper
 0.0     0 khubd

Have to say i am at a total loss as there is no way the usage figures are accurate.
If there is any other info i can supply that will be useful please let me know.
 
Regards
 
Wayne


On 6 December 2011 12:03, Igor Homyakov <address@hidden> wrote:
Hi Lawrence,

Could you be a little bit more specific ?  Please provide information
about you operation system, monit version on which the problem
occurred and so on.

Regards
Igor Homyakov

On Tue, Dec 6, 2011 at 15:35, Lawrence, Wayne
<address@hidden> wrote:
> Hi,
>
> I have a few CPU usage checks in my monitrc but it seems monit is
> misreporting the usage.
>
> I have run several tests and it seems that monit is multiplying the actual
> usage by 10.
>
> I ran a process with top running in another shell and CPU usage for the user
> was never above 10% yet monit informed me that there was 100% cpu usage.
>
> I have tried various configurations including the one that came with the
> default config for system cpu monitoring and all seem to demonstrate the
> same issue.
>
> Any advice welcomed on this
>
> Regards
>
> Wayne Lawrence




--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general


reply via email to

[Prev in Thread] Current Thread [Next in Thread]