qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/3] Reorg ppc64 pmu insn counting


From: Alex Bennée
Subject: Re: [PATCH 0/3] Reorg ppc64 pmu insn counting
Date: Tue, 04 Jan 2022 10:32:01 +0000
User-agent: mu4e 1.7.5; emacs 28.0.90

Daniel Henrique Barboza <danielhb413@gmail.com> writes:

> On 1/3/22 12:07, Alex Bennée wrote:
>> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>> 
>>> On 12/23/21 00:01, Richard Henderson wrote:
>>>> In contrast to Daniel's version, the code stays in power8-pmu.c,
>>>> but is better organized to not take so much overhead.
>>>> Before:
>>>>       32.97%  qemu-system-ppc  qemu-system-ppc64   [.] pmc_get_event
>>>>       20.22%  qemu-system-ppc  qemu-system-ppc64   [.] helper_insns_inc
>>>>        4.52%  qemu-system-ppc  qemu-system-ppc64   [.] 
>>>> hreg_compute_hflags_value
>>>>        3.30%  qemu-system-ppc  qemu-system-ppc64   [.] helper_lookup_tb_ptr
>>>>        2.68%  qemu-system-ppc  qemu-system-ppc64   [.] tcg_gen_code
>>>>        2.28%  qemu-system-ppc  qemu-system-ppc64   [.] cpu_exec
>>>>        1.84%  qemu-system-ppc  qemu-system-ppc64   [.] pmu_insn_cnt_enabled
>>>> After:
>>>>        8.42%  qemu-system-ppc  qemu-system-ppc64   [.]
>>>> hreg_compute_hflags_value
>>>>        6.65%  qemu-system-ppc  qemu-system-ppc64   [.] cpu_exec
>>>>        6.63%  qemu-system-ppc  qemu-system-ppc64   [.] helper_insns_inc
>>>>
>>>
>>> Thanks for looking this up. I had no idea the original C code was that slow.
>>>
>> <snip>
>>>
>>> With that in mind I decided to post a new version of my TCG rework, with 
>>> less repetition and
>>> a bit more concise, to have an alternative that can be used upstream to fix 
>>> the Avocado tests.
>>> Meanwhile I'll see if I can get your reorg working with all EBB tests we 
>>> need. All things
>>> equal - similar performance, all EBB tests passing - I'd rather stay with 
>>> your C code than my
>>> TCG rework since yours doesn't rely on TCG Ops knowledge to maintain
>>> it.
>> Reading this series did make me wonder if we need a more generic
>> service
>> from the TCG for helping with "internal" instrumentation needed for
>> things like decent PMU emulation. We haven't gone as much for it in ARM
>> yet but it would be nice to. It would be even nicer if such a facility
>> could be used by stuff like icount as well so we don't end up doing the
>> same thing twice.
>
> Back in May 2021 when I first starting working on this code I tried to base 
> myself in the
> ARM PMU code. In fact, the cycle and insn calculation done in the very first 
> version of
> this work was based on what ARM does in target/arm/helper.c, 
> cycles_get_count() and
> instructions_get_count(). The cycle calculation got simplified because our 
> PPC64 CPU
> has a 1Ghz clock so it's easier to just consider 1ns = 1 cycle.
>
> For instruction count, aside from my 2-3 weeks of spectacular failures trying 
> to count
> instructions inside translate.c, I also looked into how TCG plugins work and 
> tried to do
> something similar to what plugin_gen_tb_end() does at the end of the 
> translator_loop()
> in accel/tcg/translator.c. For some reason I wasn't able to replicate the 
> same behavior
> that I would have if I used the TCG plugin framework in the
> 'canonical' way.

plugin_gen_tb_end is probably overkill because we should already know
how many instructions there are in a translated block on account of the
insn_start and insn_end ops that mark them. In fact see gen_tb_end()
which is where icount updates the value used in the decrement at the
start of each block. Assuming no synchronous exceptions occur you could
just increment a counter at the end of the block as no async IRQs will
occur until we have executed all of those instructions.

Of course it's never quite so simple and when running in full icount
mode we have to take into account exceptions that can be triggered by IO
accesses. This involves doing a re-translation to ensures the IO
instruction is always the last we execute.

I'm guessing for PMU counters to be somewhat correct we would want to
ensure updates throughout the block (before each memory op and helper
call). This would hopefully avoid the cost of "full" icount support
which is only single threaded. However this is the opposite to icount's
budget and pre-decrement approach which feels messier than it could be.

> I ended up doing something similar to what instructions_get_count() from ARM 
> does, which
> relies on icount. Richard then aided me in figuring out that I could count 
> instructions
> directly by tapping into the end of each TB.

instructions_get_count will also work without icount but is affected by
wall clock time distortions in that case.

> So, for a generic service of sorts I believe it would be nice to re-use the 
> TCG plugins
> API in the internal instrumentation (I tried it once, failed, not sure if I 
> messed up
> or it's not possible ATM). That would be a good start to try to get all this 
> logic in a
> generic code for internal translate code to use.

Agreed - although the plugin specific stuff is really just focused on
our limited visibility API. Unless you are referring to
accel/tcg/plugin-gen.c which are just helpers for manipulating the TCG
ops after the initial translation.

-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]