qemu-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to tell if an emulated aarch64 CPU has stopped doing work?


From: Alex Bennée
Subject: Re: How to tell if an emulated aarch64 CPU has stopped doing work?
Date: Fri, 12 Jun 2020 19:46:09 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)

Dave Bort <dbort-PgRGKqEAcmkAvxtiuMwx3w@public.gmane.org> writes:

> We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64 
> images, running in linux containers on x86_64 alongside other workloads.
>
> We've recently run into issues where it looks like an emulated CPU (out of 
> four) sometimes stops making progress for ten or more seconds, and we're 
> trying to characterize the problem. When this
> happens, the other emulated CPUs run just fine, though sometimes two will 
> stall out at the same time.
>
> Any suggestions for how to tell if an emulated CPU stopped doing work?
>
> Based on our experiments, the guest-visible clocks and cycle counters 
> continue to run when a qemu CPU thread is suspended, so it's hard to tell 
> whether the emulation paused, or if our code is
> spinning with interrupts disabled (though evidence is mounting that that's 
> not the case). We're adding a bunch more instrumentation to our code, but 
> maybe qemu has some features that will help
> us out.
>
> I tried to find a way to count the number of TBs executed by an
> emulated core over time, but I didn't see a cheap way to do that with
> the plugin APIs.

It should be pretty cheap to do. You just need to extend the example bb
plugin to take cpu_index into account and do the proper locking to
update the instruction counter in vcpu_tb_exec.

The qemu_plugin_register_vcpu_idle_cb and
qemu_plugin_register_vcpu_resume_cb functions allow you to register call
backs for everytime we exit the main run loop and sleep for whatever
reason. You could even dump the total instruction counts there.

>
> We could maybe turn on instruction tracing, but this problem happens pretty 
> rarely (<1%), we don't have a repro case yet, and we can't really afford the 
> cost of slowing down every test run.
> There's a decent chance that this is caused by an overloaded host, but our 
> host-side investigations haven't turned up anything concrete either.
>
> Any advice?
>
> --dbort
>

-- 
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]