[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent e
From: |
zhukeqian |
Subject: |
答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment |
Date: |
Tue, 19 Mar 2024 05:06:51 +0000 |
Hi David,
Thanks for reviewing.
On 17.03.24 09:37, Keqian Zhu via wrote:
>> Both main loop thread and vCPU thread are allowed to call
>> pause_all_vcpus(), and in general resume_all_vcpus() is called after
>> it. Two issues live in pause_all_vcpus():
>
>In general, calling pause_all_vcpus() from VCPU threads is quite dangerous.
>
>Do we have reproducers for the cases below?
>
I produce the issues by testing ARM vCPU hotplug feature:
QEMU changes for vCPU hotplug could be cloned from below site,
https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2
Guest Kernel changes (by James Morse, ARM) are available here:
https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git
virtual_cpu_hotplug/rfc/v2
The procedure to produce problems:
1. Startup a Linux VM (e.g., called OS-vcpuhotplug) with 32 possible vCPUs and
16 current vCPUs.
2. Log in guestOS and run script[1] to continuously online/offline CPU.
3. At host side, run script[2] to continuously hotplug/unhotplug vCPU.
After several minutes, we can hit these problems.
Script[1] to online/offline CPU:
for ((time=1;time<10000000;time++));
do
for ((cpu=16;cpu<32;cpu++));
do
echo 1 > /sys/devices/system/cpu/cpu$cpu/online
done
for ((cpu=16;cpu<32;cpu++));
do
echo 0 > /sys/devices/system/cpu/cpu$cpu/online
done
done
Script[2] to hotplug/unhotplug vCPU:
for ((time=1;time<1000000;time++));
do
echo $time
for ((cpu=16;cpu<=32;cpu++));
do
echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live"
virsh setvcpus OS-vcpuhotplug --count $cpu --live
sleep 2
done
for ((cpu=32;cpu>=16;cpu--));
do
echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live"
virsh setvcpus OS-vcpuhotplug --count $cpu --live
sleep 2
done
for ((cpu=16;cpu<=32;cpu+=2));
do
echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live"
virsh setvcpus OS-vcpuhotplug --count $cpu --live
sleep 2
done
for ((cpu=32;cpu>=16;cpu-=2));
do
echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live"
virsh setvcpus OS-vcpuhotplug --count $cpu --live
sleep 2
done
done
The script[1] will call PSCI CPU_ON which emulated by QEMU, which result in
calling cpu_reset() on vCPU thread.
For ARM architecture, it needs to reset GICC registers, which is only possible
when all vcpus paused. So script[1]
will call pause_all_vcpus() in vCPU thread.
The script[2] also calls cpu_reset() for newly hotplugged vCPU, which is done
in main loop thread.
So this scenario causes problems as I state in commit message.
>>
>> 1. There is possibility that during thread T1 waits on qemu_pause_cond
>> with bql unlocked, other thread has called
>> pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck,
>> because the condition all_vcpus_paused() is always false.
>
>How can this happen?
>
>Two threads calling pause_all_vcpus() is borderline broken, as you note.
>
>IIRC, we should call pause_all_vcpus() only if some other mechanism prevents
>these races. For example, based on runstate changes.
>
We already has bql to prevent concurrent calling of pause_all_vcpus() and
resume_all_vcpus(). But pause_all_vcpus() will
unlock bql in the half way, which gives change for other thread to call pause
and resume. In the past, code does not consider
this problem, now I add retry mechanism to fix it.
>
>Just imagine one thread calling pause_all_vcpus() while another one
>calls resume_all_vcpus(). It cannot possibly work.
With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish,
and all vcpus are resumed after resume_all_vcpus() finish.
For example, the following situation may occur:
Thread T1: lock bql -> pause_all_vcpus -> wait on cond and unlock bql
-> wait T2 unlock bql to lock bql
-> lock bql && all_vcpu_paused -> success and do other work -> unlock bql
Thread T2: wait T1 unlock bql to lock bql
-> lock bql -> resume_all_vcpus -> success and do other work
-> unlock bql
Thanks,
Keqian
>
>
>>
>> 2. After all_vcpus_paused() has been checked as true, we will
>> unlock bql to relock replay_mutex. During the bql was unlocked,
>> the vcpu's state may has been changed by other thread, so we
>> must retry.
>>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> ---
>> system/cpus.c | 29 ++++++++++++++++++++++++-----
>> 1 file changed, 24 insertions(+), 5 deletions(-)
>>
> diff --git a/system/cpus.c b/system/cpus.c
> index 68d161d96b..4e41abe23e 100644
> --- a/system/cpus.c
> +++ b/system/cpus.c
> @@ -571,12 +571,14 @@ static bool all_vcpus_paused(void)
> return true;
> }
>
> -void pause_all_vcpus(void)
> +static void request_pause_all_vcpus(void)
> {
> CPUState *cpu;
>
> - qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);
> CPU_FOREACH(cpu) {
> + if (cpu->stopped) {
> + continue;
> + }
> if (qemu_cpu_is_self(cpu)) {
> qemu_cpu_stop(cpu, true);
> } else {
> @@ -584,6 +586,14 @@ void pause_all_vcpus(void)
> qemu_cpu_kick(cpu);
> }
> }
> +}
> +
> +void pause_all_vcpus(void)
> +{
> + qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);
> +
> +retry:
> + request_pause_all_vcpus();
>
> /* We need to drop the replay_lock so any vCPU threads woken up
> * can finish their replay tasks
> @@ -592,14 +602,23 @@ void pause_all_vcpus(void)
>
> while (!all_vcpus_paused()) {
> qemu_cond_wait(&qemu_pause_cond, &bql);
> - CPU_FOREACH(cpu) {
> - qemu_cpu_kick(cpu);
> - }
> + /* During we waited on qemu_pause_cond the bql was unlocked,
> + * the vcpu's state may has been changed by other thread, so
> + * we must request the pause state on all vcpus again.
> + */
> + request_pause_all_vcpus();
> }
>
> bql_unlock();
> replay_mutex_lock();
> bql_lock();
> +
> + /* During the bql was unlocked, the vcpu's state may has been
> + * changed by other thread, so we must retry.
> + */
> + if (!all_vcpus_paused()) {
> + goto retry;
> + }
> }
>
> void cpu_resume(CPUState *cpu)
--
Cheers,
David / dhildenb
- [PATCH v1 0/2] Some fixes for pause and resume all vcpus, Keqian Zhu, 2024/03/17
- [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, Keqian Zhu, 2024/03/17
- Re: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/18
- 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment,
zhukeqian <=
- Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/19
- Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/19
- Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, Peter Maydell, 2024/03/19
- Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/19
- Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, Peter Maydell, 2024/03/19
[PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition, Keqian Zhu, 2024/03/17