qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TCG performance on PPC64


From: David Gibson
Subject: Re: TCG performance on PPC64
Date: Thu, 19 May 2022 14:13:03 +1000

On Wed, May 18, 2022 at 10:16:17AM -0300, Matheus K. Ferst wrote:
> Hi,
> 
> Since we started working with QEMU on PPC, we've noticed that
> emulating PPC64 VMs is faster in x86_64 than PPC64 itself, even when
> compared with x86 machines that are slower in other workloads (like building
> QEMU or the Linux kernel).
> 
> We thought it would be related to the TCG backend, which would be better
> optimized on x86. As a first approach to better understand the problem, I
> ran some boot tests with Fedora Cloud Base 35-1.2[1] on both platforms.
> Using the command line
> 
> ./qemu-system-ppc64 -name Fedora-Cloud-Base-35-1.2.ppc64le -smp 2 -m 2G -vga
> none -nographic -serial pipe:Fedora-Cloud-Base-35-1.2.ppc64le -monitor
> unix:Fedora-Cloud-Base-35-1.2.ppc64le.mon,server,nowait -device
> virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso
> -cpu POWER10 -accel tcg -device virtio-scsi-pci -drive
> file=Fedora-Cloud-Base-35-1.2.ppc64le.temp.qcow2,if=none,format=qcow2,id=hd0
> -device scsi-hd,drive=hd0 -boot c
> 
> in a POWER9 DD2.2 and an Intel Xeon E5-2687W, a simple bash script reads the
> ".out" pipe until the "fedora login:" string is found and then issues a
> "system_powerdown" through QEMU monitor. The ."temp.qcow2" file is backed by
> the original Fedora image and deleted at the end of the test, so every boot
> is fresh. Running the test 10 times gave us 235.26 ± 6.27 s on PPC64 and
> 192.92 ± 4.53 s on x86_64, i.e., TCG is ~20% slower in the POWER9.
> 
> As a second step, I wondered if this gap would be the same when emulating
> other architectures on PPC64, so I used the same version of Fedora Cloud for
> aarch64[2] and s390x[3], using the following command lines:
> 
> ./qemu-system-aarch64 -name Fedora-Cloud-Base-35-1.2.aarch64 -smp 2 -m 2G
> -vga none -nographic -serial pipe:Fedora-Cloud-Base-35-1.2.aarch64 -monitor
> unix:Fedora-Cloud-Base-35-1.2.aarch64.mon,server,nowait -device
> virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso
> -machine virt -cpu max -accel tcg -device virtio-scsi-pci -drive
> file=Fedora-Cloud-Base-35-1.2.aarch64.temp.qcow2,if=none,format=qcow2,id=hd0
> -device scsi-hd,drive=hd0 -boot c -bios ./pc-bios/edk2-aarch64-code.fd
> 
> and
> 
> ./qemu-system-s390x -name Fedora-Cloud-Base-35-1.2.s390x -smp 2 -m 2G -vga
> none -nographic -serial pipe:Fedora-Cloud-Base-35-1.2.s390x -monitor
> unix:Fedora-Cloud-Base-35-1.2.s390x.mon,server,nowait -device
> virtio-net,netdev=vmnic -netdev user,id=vmnic -cdrom fedora-cloud-init.iso
> -machine s390-ccw-virtio -cpu max -accel tcg -hda
> Fedora-Cloud-Base-35-1.2.s390x.temp.qcow2 -boot c
> 
> With 50 runs, we got:
> 
> +---------+---------------------------------+
> |         |               Host              |
> |  Guest  +----------------+----------------+
> |         |      PPC64     |     x86_64     |
> +---------+----------------+----------------+
> | PPC64   |  194.72 ± 7.28 |  162.75 ± 8.75 |
> | aarch64 |  501.89 ± 9.98 | 586.08 ± 10.55 |
> | s390x   | 294.10 ± 21.62 | 223.71 ± 85.30 |
> +---------+----------------+----------------+
> 
> The difference with an s390x guest is around ~30%, with a greater
> variability on x86_64 that I couldn't find the source. However, POWER9
> emulates aarch64 faster than this Xeon.
> 
> The particular workload of the guest could distort this result since in the
> first boot Cloud-Init will create user accounts, generate SSH keys, etc. If
> the aarch64 guest uses many vector instructions for this initial setup, that
> might explain why an older Xeon would be slower here.
> 
> As a final test, I changed the images to have a normal user account already
> created and unlocked, disabled Cloud-Init, downloaded bc-1.07 sources[4][5],
> installed its build dependencies[6], and changed the test script to login,
> extract, configure, build, and shutdown the guest. I also added an aarch64
> compatible machine (Apple M1 w/ 10 cores) to our test setup. Running 100
> iterations gave us the following results:
> 
> +---------+----------------------------------------------------+
> |         |                        Host                        |
> |  Guest  +-----------------+-----------------+----------------+
> |         |      PPC64      |     x86_64      |     aarch64    |
> +---------+-----------------+-----------------+----------------+
> | PPC64   |  429.82 ± 11.57 |   352.34 ± 8.51 | 180.78 ± 42.02 |
> | aarch64 | 1029.78 ± 46.01 | 1207.98 ± 80.49 |  487.50 ± 7.54 |
> | s390x   |  589.97 ± 86.67 |  411.83 ± 41.88 | 221.86 ± 79.85 |
> +---------+-----------------+-----------------+----------------+
> 
> The pattern with PPC64 vs. x86_64 remains: PPC64/s390x guests are ~20%/~30%
> slower on POWER9, but the aarch64 VM is slower on this Xeon. If the PPC
> backend can perform better than the x86 when emulating some architectures, I
> guess that improving PPC64-on-PPC64 emulation isn't "just" TCG backend
> optimization but a more complex problem to tackle.
> 
> What would be different in aarch64 emulation that yields a better
> performance on our POWER9?
>  - I suppose that aarch64 has more instructions with GVec implementations
> than PPC64 and s390x, so maybe aarch64 guests can better use host-vector
> instructions?

As with Richard, I think it's pretty unlikely that this would make
such a difference.  With a pure number crunching vector workload in
the guest, maybe, with kernel & userspace boot, not really.  It might
be interesting to configure a guest CPU without vector support to
double check if it makes any differece though.

>  - Looking at the flame graphs of each test (attached), I can see that
> tb_gen_code takes proportionally less time of aarch64 emulation than PPC64
> and s390x, so it might be that decodetree is faster?
>  - There is more than TCG at play, so perhaps the differences can be better
> explained by VirtIO performance or something else?

Also seems unlikely to me; I don't really see how this would differ
enough based on guest type to make the difference we see here.

> Currently, Leandro Lupori is working to improve TLB invalidation[7], Victor
> Colombo is working to enable hardfpu in some scenarios, and I'm reviewing
> some older helpers that can use GVec or easily implemented inline. We're
> also planning to add some Power ISA v3.1 instructions to the TCG backend,
> but it's probably better to test on hardware if our changes are doing any
> good, and we don't have access to a POWER10 yet.
> 
> Are there any other known performance problems for TCG on PPC64 that we
> should investigate?

Known?  I don't think so.  The TCG code is pretty old and clunky
though, so there could be all manner of problems lurking in there.


A couple of thougts:

 * I wonder how much emulation of guest side synchronization
   instructions might be a factor here.  That's one of the few things
   I can think of where the matchup between host and guest models
   might make a difference.  It might be interesting to try these
   tests with single core guests.  Likewise it might be interesting to
   get results with multi-core guests, but MTTCG explicitly disabled. 

 * It might also be interesting to get CPU time results as well as
   elapsed time.  That might indicate whether qemu is doing more
   actual work in the slow cases, or if it's blocking for some
   non-obvious reason.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]