monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: total cpu process bug?


From: Martin Pala
Subject: Re: total cpu process bug?
Date: Wed, 11 Jan 2012 20:01:47 +0100

Hi Tom,

you're absolutely correct - there was bug in the cpu usage which incorrectly capped the CPU usage to the fraction equivalent to single CPU core. As you mentioned, the problem could occur when monitoring CPU usage of multi-threaded processes on multi-core machines.

Thanks for the patch, it will be part of the next release.

Best regards,
Martin



--- monit/trunk/src/process.c (original)
+++ monit/trunk/src/process.c Wed Jan 11 19:55:27 2012
@@ -233,8 +233,8 @@
      /* The cpu_percent may be set already (for example by HPUX module) */
      if (pt[i].cpu_percent  == 0 && pt[i].cputime_prev != 0 && pt[i].cputime != 0 && pt[i].cputime > pt[i].cputime_prev) {
        pt[i].cpu_percent = (int)((1000 * (double)(pt[i].cputime - pt[i].cputime_prev) / (pt[i].time - pt[i].time_prev)) / systeminfo.cpus);
-        if (pt[i].cpu_percent > 1000 / systeminfo.cpus)
-          pt[i].cpu_percent = 1000 / systeminfo.cpus;
+        if (pt[i].cpu_percent > 1000)
+          pt[i].cpu_percent = 1000;
      }
    } else {
      pt[i].cputime_prev = 0;




On Jan 6, 2012, at 9:32 PM, Tom Pepper wrote:

Hi, Martin:

Can you clarify what exactly these two lines do in process.c's cpu percentage calculation?

        if (pt[i].cpu_percent > 1000 / systeminfo.cpus)
          pt[i].cpu_percent = 1000 / systeminfo.cpus;

They're causing total cpu to be misreported when processes use a large amount of CPU and many cores are present.  Shouldn't the "/ systeminfo.cpus" be dropped in both cases?  I assume it's meant to keep any strange math from causing process cpu percentage to ever exceed 100%.

For example, with a 120s query delay, a process I have on a 24 core box calculates with process.c's logic as:

cputime = 4809915 cputime_prev = 4803601 (delta 6314)
time = 13258814089.516930 time_prev = 13258812889.395201 (delta 1200)

cputime - cputime_prev / time - time_prev = 6314/1200 = 5.26
1000 * 5.26 / 24 cpus = 219 "pt[i].cpu_percent" (which appears to represent 21.9% in monitese), which is accurate.

1000 / num_cpus is 41.6 on my box.  since 219 >> 41.6 it gets cut back to 41.6.

Thanks,
-t


On Jan 5, 2012, at 4:33 AM, Martin Pala wrote:

Yes, Wayne is correct and the usage is computed exactly as he described. Monit takes the summary of all CPU cores as 100%.

Regards,
Martin



On Jan 5, 2012, at 10:54 AM, Lawrence, Wayne wrote:

May be wrong and i am sure someone will correct me if i am but it appears the way the cpu usage is worked out against the multiple cores is why you are getting this output.
 
The way i worked it out is the way i believe monit works it out and the maths sort of make sense.
 
24 cores  24 x 100% = 2400
 
so if you divide 2400 by your usage from top
 
2400 / 578 = 4.2
 
which would give you your percentage shown in monit.
 
Regards
 
Wayne
 


 
On 5 January 2012 08:13, Tom Pepper <address@hidden> wrote:
Hello:

I have a number of high-CPU processes that run on 24-core boxes configured e.g.:

check process emr-enc01-01 with pidfile /var/run/tada_liveenc_emr-enc01-01.pid
  start program = "/usr/local/tada/launch.sh -c emr-enc01-01"
  stop program = "/bin/bash -c 'kill -s SIGTERM `/bin/cat /var/run/tada_liveenc_emr-enc01-01.pid`'"
  if totalmem > 80% then alert
  if totalmem > 90% then restart
  if totalcpu < 10% for 10 cycles then alert

These processes create pidfiles which match correctly in top as:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                            
 1710 root      20   0 3064m 1.2g 7808 S  578 15.8  47:31.53 tada_liveenc                                                        
 1866 root      20   0 2954m 1.3g 7804 S  545 16.7  45:18.52 tada_liveenc     

However, monit sees these as a completely different total CPU usage:

Process 'emr-enc01-01'
  status                            Running
  monitoring status                 Monitored
  pid                               1710
  parent pid                        1
  uptime                            8m 
  children                          0
  memory kilobytes                  1372300
  memory kilobytes total            1372300
  memory percent                    16.7%
  memory percent total              16.7%
  cpu percent                       4.1%
  cpu percent total                 4.1%
  data collected                    Thu, 05 Jan 2012 00:05:49

Process 'emr-enc01-02'
  status                            Running
  monitoring status                 Monitored
  pid                               1866
  parent pid                        1
  uptime                            8m 
  children                          0
  memory kilobytes                  1362240
  memory kilobytes total            1362240
  memory percent                    16.6%
  memory percent total              16.6%
  cpu percent                       4.1%
  cpu percent total                 4.1%
  data collected                    Thu, 05 Jan 2012 00:05:49

Any thoughts on why this might be happening?  Hosts are ubuntu natty.  The master processes themselves spawn about 150 threads (not forks).

FYI:

662 address@hidden: $ uname -m
x86_64

663 address@hidden: $ file `which monit`
/usr/local/bin/monit: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.0, not stripped

664 address@hidden: $ monit -V
This is Monit version 5.3.2
Copyright (C) 2000-2011 Tildeslash Ltd. All Rights Reserved.

Thanks in advance,
-Tom

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general

--
To unsubscribe:
https://lists.nongnu.org/mailman/listinfo/monit-general


reply via email to

[Prev in Thread] Current Thread [Next in Thread]