[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC] Don't lookup full CPU state in the indirect branch fast path o
From: |
Richard Henderson |
Subject: |
Re: [RFC] Don't lookup full CPU state in the indirect branch fast path on AArch64 when running in user mode. |
Date: |
Mon, 19 Oct 2020 11:22:52 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 |
On 9/29/20 2:32 PM, Owen Anderson wrote:
> Hello,
>
> I would like to request feedback on the following patch, which I do
> not believe should be applied to master as-is. The idea here is to
> avoid gathering the full CPU state in the fast path of an indirect
> branch lookup when running in user mode on a platform where the flags
> can only be changed in privileged mode. I believe this is true on the
> AArch64 scenario that I care about, but clearly not true in general.
> I'm particularly seeking feedback on how to clean this up into a
> version that checks the correct necessary and sufficient conditions to
> allow all users that can benefit from it to do so.
>
> On the workload that I am targeting (aarch64 on x86), this patch
> reduces execution wall time by approximately 20%, and eliminates
> indirect branch lookups from the hot stack traces entirely.
(1) What qemu version are you looking at and,
(2) Do you have --enable-tcg-debug enabled?
Because you should not be seeing anything even close to 20% overhead.
In e979972a6a1 (included in qemu 4.2), the AArch64 path is
uint32_t flags = env->hflags;
*cs_base = 0;
if (FIELD_EX32(flags, TBFLAG_ANY, AARCH64_STATE)) {
*pc = env->pc;
if (cpu_isar_feature(aa64_bti, env_archcpu(env))) {
flags = FIELD_DP32(flags, TBFLAG_A64,
BTYPE, env->btype);
}
pstate_for_ss = env->pstate;
}
if (FIELD_EX32(flags, TBFLAG_ANY, SS_ACTIVE) &&
(pstate_for_ss & PSTATE_SS)) {
flags = FIELD_DP32(flags, TBFLAG_ANY, PSTATE_SS, 1);
}
*pflags = flags;
With --enable-tcg-debug, there is an additional step wherein we validate that
env->hflags has the correct value. Which has caught a number of bugs.
With a silly testcase like so:
for (int x = 0; x < 10000000; ++x) {
void *tmp;
asm volatile("adr %0,1f; br %0; 1:" : "=r"(tmp));
}
I see cpu_get_tb_cpu_state no higher than 10% of the total runtime. Which, I
admit is higher than I expected, but still nothing like what you're reporting.
And a "reasonable" test case should surely have a lower proportion of indirect
branches per dynamic instruction.
> +#if !defined(TARGET_ARM) || !defined(CONFIG_USER_ONLY)
> cpu_get_tb_cpu_state(env, pc, cs_base, flags);
> +#else
> + if (is_a64(env)) {
> + *pc = env->pc;
> + } else {
> + *pc = env->regs[15];
> + }
> +#endif
...
> +#if !defined(TARGET_ARM) || !defined(CONFIG_USER_ONLY)
> tb->cs_base == *cs_base &&
> tb->flags == *flags &&
> +#endif
This is assuming that all TB have the same flags, and thus the flags don't need
comparing. Which is false, even for CONFIG_USER_ONLY.
I would guess that testing -cpu cortex-a57 does not use any of the bits that
might change, but post v8.2 will: SVE, BTI, MTE. So, this change breaks stuff.
r~