Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL

From:	Robert Foley
Subject:	Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL
Date:	Mon, 18 May 2020 09:46:36 -0400

We re-ran the numbers with the latest re-based series.

We used an aarch64 ubuntu VM image with a host CPU:
Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 2 CPUs, 10 cores/CPU,
20 Threads/CPU.  40 cores total.

For the bare hardware and kvm tests (first chart) the host CPU was:
HiSilicon 1620 CPU 2600 Mhz,  2 CPUs, 64 Cores per CPU, 128 CPUs total.

First, we ran a test of building the kernel in the VM.
We did not see any major improvements nor major regressions.
We show the results of the Speedup of building the kernel
on bare hardware compared with kvm and QEMU (both the baseline and cpu locks).


                   Speedup vs a single thread for kernel build

  40 +----------------------------------------------------------------------+
     |         +         +         +          +         +         +  **     |
     |                                                bare hardwar********* |
     |                                                          kvm ####### |
  35 |-+                                                   baseline $$$$$$$-|
     |                                                    *cpu lock %%%%%%% |
     |                                                 ***                  |
     |                                               **                     |
  30 |-+                                          ***                     +-|
     |                                         ***                          |
     |                                      ***                             |
     |                                    **                                |
  25 |-+                               ***                                +-|
     |                              ***                                     |
     |                            **                                        |
     |                          **                                          |
  20 |-+                      **                                          +-|
     |                      **                                #########     |
     |                    **                  ################              |
     |                  **          ##########                              |
     |                **         ###                                        |
  15 |-+             *       ####                                         +-|
     |             **     ###                                               |
     |            *    ###                                                  |
     |           *  ###                                                     |
  10 |-+       **###                                                      +-|
     |        *##                                                           |
     |       ##  $$$$$$$$$$$$$$$$                                           |
     |     #$$$$$%%%%%%%%%%%%%%%%%%%%                                       |
   5 |-+  $%%%%%%                    %%%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%    +-|
     |   %%                                                           %     |
     | %%                                                                   |
     |%        +         +         +          +         +         +         |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70
                                   Guest vCPUs


After seeing these results and the scaling limits inherent in the build itself,
we decided to run a test which might show the scaling improvements clearer.
So we chose unix bench.

               Unix bench result (higher is better) vs number vCPUs.

  3000 +--------------------------------------------------------------------+
       |      +      +      +      +      +     +      +      +      +      |
       |                                                   baseline ******* |
       |             #                                     cpu lock ####### |
       |           ##*#                                                     |
  2500 |-+        #** *#                                                  +-|
       |          #    *#                                                   |
       |         #*    *#                                                   |
       |         #      *#                                                  |
       |        #*       #                                                  |
       |        #        *#                                                 |
  2000 |-+     #*         #                                               +-|
       |       #          *#                                                |
       |      #*           *#                                               |
       |      #             *####                                           |
       |     #*             *    ###                                        |
  1500 |-+   #               ***    ##                                    +-|
       |     #                  *     ##                                    |
       |    #                    *      ###                                 |
       |    #                     **       ##                               |
       |    #                       *        ###                            |
       |   #                         *          ##                          |
  1000 |-+ #                          **          #                       +-|
       |  #                             *          ###                      |
       |  #                              **           #                     |
       |  #                                *           #                    |
       | #*                                 *           ##                  |
   500 |-#                                   **           #         #     +-|
       | #                                     *           #      ## #      |
       |#*                                      *           ##   #    #     |
       |#*                                       **            ##      #    |
       |*                                                     #         #   |
       |*     +      +      +      +      +     +  **********************#  |
     0 +--------------------------------------------------------------------+
       0      10     20     30     40     50    60     70     80     90    100
                                    Guest vCPUs

We also ran tests to compare the boot times.  This test showed the most
improvements compared to the baseline.

              Boot time in seconds (lower is better) vs number vCPUs.

  550 +---------------------------------------------------------------------+
      |      +      +      +      +      +      +      +      +      +   *  |
      |                                                    baseline ******* |
  500 |-+                                                  cpu lock #######-|
      |                                                              *      |
      |                                                             *       |
      |                                                            *        |
  450 |-+                                                        **      #+-|
      |                                                         *       #   |
      |                                            **          *      ##    |
  400 |-+                                         *  **      **      #    +-|
      |                                           *    *   **       #       |
      |                                          *       **       ##        |
  350 |-+                                       *       *        #        +-|
      |                                         *              ##           |
      |                                        *              #             |
  300 |-+                                     *             ##            +-|
      |                                       *            #                |
      |                                      *           ##                 |
      |                                     *           #                   |
  250 |-+                                 **           #                  +-|
      |                                  *           ##                     |
      |                                **           #                       |
  200 |-+                           ***           ##                      +-|
      |                           **           ###                          |
      |                          *         ####                             |
  150 |-+                       *    ######                               +-|
      |                     ****  ###                                       |
      |*                   *    ##                                          |
      |#*                #######                                            |
  100 |-#          ***###                                                 +-|
      | #*     #######                                                      |
      |  ######     +      +      +      +      +      +      +      +      |
   50 +---------------------------------------------------------------------+
      0      10     20     30     40     50     60     70     80     90    100
                                    Guest vCPUs

Pictures are also here:
https://drive.google.com/file/d/1ASg5XyP9hNfN9VysXC3qe5s9QSJlwFAt/view?usp=sharing

We will plan to update this commit in the series with the final two results
(unix bench and boot times).

Regards,
-Rob


On Tue, 12 May 2020 at 15:26, Robert Foley <address@hidden> wrote:
>
> On Tue, 12 May 2020 at 12:27, Alex Bennée <address@hidden> wrote:
> > Robert Foley <address@hidden> writes:
> >
> > > From: "Emilio G. Cota" <address@hidden>
> > >
> > > This yields sizable scalability improvements, as the below results show.
> > >
> > > Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)
> > >
> > > Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
> > > "make -j N", where N is the number of cores in the guest.
> > >
> > >                       Speedup vs a single thread (higher is better):
> snip
> > >   png: https://imgur.com/zZRvS7q
> >
> > Can we re-run these numbers on the re-based series?
>
> Sure, we will re-run the numbers.
>
> Regards,
> -Rob

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL, Alex Bennée, 2020/05/12
- Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL, Robert Foley, 2020/05/12
  - Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL, Robert Foley <=
    - Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL, Emilio G. Cota, 2020/05/20
    - Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL, Robert Foley, 2020/05/20
    - Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL, Robert Foley, 2020/05/21
- Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL, Alex Bennée, 2020/05/12

Prev by Date: Re: [PATCH 1/2] hw/display: Include local 'framebuffer.h'
Next by Date: Re: [PATCH] hw: Use QEMU_IS_ALIGNED() on parallel flash block size
Previous by thread: Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL
Next by thread: Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL
Index(es):
- Date
- Thread