qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TCG performance on PPC64


From: David Gibson
Subject: Re: TCG performance on PPC64
Date: Mon, 30 May 2022 14:17:18 +1000

On Thu, May 26, 2022 at 08:07:07AM -0300, Matheus K. Ferst wrote:
> On 19/05/2022 01:13, David Gibson wrote:
> >> What would be different in aarch64 emulation that yields a better
> >> performance on our POWER9?
> >>  - I suppose that aarch64 has more instructions with GVec implementations
> >> than PPC64 and s390x, so maybe aarch64 guests can better use host-vector
> >> instructions?
> >
> > As with Richard, I think it's pretty unlikely that this would make
> > such a difference.  With a pure number crunching vector workload in
> > the guest, maybe, with kernel & userspace boot, not really.  It might
> > be interesting to configure a guest CPU without vector support to
> > double check if it makes any differece though.
> >
> >>  - Looking at the flame graphs of each test (attached), I can see that
> >> tb_gen_code takes proportionally less time of aarch64 emulation than
> PPC64
> >> and s390x, so it might be that decodetree is faster?
> >>  - There is more than TCG at play, so perhaps the differences can be
> better
> >> explained by VirtIO performance or something else?
> >
> > Also seems unlikely to me; I don't really see how this would differ
> > enough based on guest type to make the difference we see here.
> >
> >> Currently, Leandro Lupori is working to improve TLB invalidation[7],
> Victor
> >> Colombo is working to enable hardfpu in some scenarios, and I'm reviewing
> >> some older helpers that can use GVec or easily implemented inline. We're
> >> also planning to add some Power ISA v3.1 instructions to the TCG backend,
> >> but it's probably better to test on hardware if our changes are doing any
> >> good, and we don't have access to a POWER10 yet.
> >>
> >> Are there any other known performance problems for TCG on PPC64 that we
> >> should investigate?
> >
> > Known?  I don't think so.  The TCG code is pretty old and clunky
> > though, so there could be all manner of problems lurking in there.
> >
> >
> > A couple of thougts:
> >
> >  * I wonder how much emulation of guest side synchronization
> >    instructions might be a factor here.  That's one of the few things
> >    I can think of where the matchup between host and guest models
> >    might make a difference.
> 
> That's an interesting suggestion, we'll be looking into this. It seems
> similar to Nicholas Piggin's recent works, and there is probably more to be
> done in this area.
> 
> >  It might be interesting to try these
> >    tests with single core guests.  Likewise it might be interesting to
> >    get results with multi-core guests, but MTTCG explicitly disabled.
> >
> 
> With 50 runs:
> 
> +---------+--------------------------------+
> |         |              Host              |
> | Options +---------------+----------------+
> |         |     PPC64     |     x86_64     |
> +---------+---------------+----------------+
> | -smp 2  | 427.41 ± 7.89 |  350.89 ± 7.62 |
> | -smp 1  | 574.01 ± 4.18 | 411.27 ± 17.14 |
> | No MTTCG| 588.84 ± 8.50 | 445.30 ± 21.66 |
> +---------+---------------+----------------+
> 
> The gap with x86 has increased in the two new cases, but I'm not sure if I
> can draw anything from this result. Maybe it's just SMT vs. Hyper-Thread
> that benefits POWER9 in the initial test, or the Xeon is better at boosting
> a single core when QEMU uses only one thread.

Ok.  That suggests to me the problem is not related to synchronization
instructions; there should be less of those necessary with a UP
guest.  It's not conclusive, of course.

> >  * It might also be interesting to get CPU time results as well as
> >    elapsed time.  That might indicate whether qemu is doing more
> >    actual work in the slow cases, or if it's blocking for some
> >    non-obvious reason.
> 
> The results above and in my first email were wall clock time, but I also
> have user and system times on a GitHub wiki page:
> https://github.com/PPC64/qemu/wiki/TCG-Performance-on-PPC64

Ok.  For the ppc64 guest, the user and elapsed times seem to be more
or less in proportion to each other.  They're a bit more mismatched
with the other guests, but in different directions.  Not really sure
what to make of that; I guess it does suggest that blocking delays
could be a factor.

The fact that the variance in both elapsed and user time for the s390x
is so much higher than the others is.. interesting.  Really don't know
what to make of that.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]