答复: [PATCH v1 1/2] system/cpus: Fix pause_all

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent e

From:	zhukeqian
Subject:	答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
Date:	Tue, 19 Mar 2024 05:06:51 +0000

Hi David,

Thanks for reviewing.

On 17.03.24 09:37, Keqian Zhu via wrote:
>> Both main loop thread and vCPU thread are allowed to call 
>> pause_all_vcpus(), and in general resume_all_vcpus() is called after 
>> it. Two issues live in pause_all_vcpus():
>
>In general, calling pause_all_vcpus() from VCPU threads is quite dangerous.
>
>Do we have reproducers for the cases below? 
>

I produce the issues by testing ARM vCPU hotplug feature:
QEMU changes for vCPU hotplug could be cloned from below site,
     https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2
Guest Kernel changes (by James Morse, ARM) are available here:
     https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git 
virtual_cpu_hotplug/rfc/v2

The procedure to produce problems:
1. Startup a Linux VM (e.g., called OS-vcpuhotplug) with 32 possible vCPUs and 
16 current vCPUs.
2. Log in guestOS and run script[1] to continuously online/offline CPU.
3. At host side, run script[2] to continuously hotplug/unhotplug vCPU.
After several minutes, we can hit these problems.

Script[1] to online/offline CPU:
for ((time=1;time<10000000;time++));
do
        for ((cpu=16;cpu<32;cpu++));
        do
                echo 1 > /sys/devices/system/cpu/cpu$cpu/online
        done

        for ((cpu=16;cpu<32;cpu++));
        do
                echo 0 > /sys/devices/system/cpu/cpu$cpu/online
        done
done

Script[2] to hotplug/unhotplug vCPU:
for ((time=1;time<1000000;time++));
do
        echo $time
        for ((cpu=16;cpu<=32;cpu++));
        do
                echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
                virsh setvcpus OS-vcpuhotplug --count  $cpu --live
                sleep 2
        done

        for ((cpu=32;cpu>=16;cpu--));
        do
                echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
                virsh setvcpus OS-vcpuhotplug --count  $cpu --live
                sleep 2
        done

        for ((cpu=16;cpu<=32;cpu+=2));
        do
                echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
                virsh setvcpus OS-vcpuhotplug --count  $cpu --live
                sleep 2
        done

        for ((cpu=32;cpu>=16;cpu-=2));
        do
                echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
                virsh setvcpus OS-vcpuhotplug --count  $cpu --live
                sleep 2
        done
done

The script[1] will call PSCI CPU_ON which emulated by QEMU, which result in 
calling cpu_reset() on vCPU thread.
For ARM architecture, it needs to reset GICC registers, which is only possible 
when all vcpus paused. So script[1]
will call pause_all_vcpus() in vCPU thread.
The script[2] also calls cpu_reset() for newly hotplugged vCPU, which is done 
in main loop thread.
So this scenario causes problems as I state in commit message.

>> 
>> 1. There is possibility that during thread T1 waits on qemu_pause_cond 
>> with bql unlocked, other thread has called
>> pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck, 
>> because the condition all_vcpus_paused() is always false.
>
>How can this happen?
>
>Two threads calling pause_all_vcpus() is borderline broken, as you note. 
>
>IIRC, we should call pause_all_vcpus() only if some other mechanism prevents 
>these races. For example, based on runstate changes.
>

We already has bql to prevent concurrent calling of pause_all_vcpus() and 
resume_all_vcpus(). But pause_all_vcpus() will
unlock bql in the half way, which gives change for other thread to call pause 
and resume. In the  past, code does not consider
this problem, now I add retry mechanism to fix it.

>
>Just imagine one thread calling pause_all_vcpus() while another one 
>calls resume_all_vcpus(). It cannot possibly work.

With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, 
 and all vcpus are resumed after resume_all_vcpus() finish.

For example, the following situation may occur:
Thread T1:     lock bql  ->    pause_all_vcpus ->   wait on cond and unlock bql 
 ->   wait T2 unlock bql to lock bql                                            
-> lock bql  &&  all_vcpu_paused ->   success and do other work -> unlock bql
Thread T2:                             wait T1 unlock bql to lock bql           
 ->   lock bql    ->      resume_all_vcpus   ->   success  and do other work   
-> unlock bql

Thanks,
Keqian

>
>
>> 
>> 2. After all_vcpus_paused() has been checked as true, we will
>> unlock bql to relock replay_mutex. During the bql was unlocked,
>> the vcpu's state may has been changed by other thread, so we
>> must retry.
>> 
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> ---
>>   system/cpus.c | 29 ++++++++++++++++++++++++-----
>>   1 file changed, 24 insertions(+), 5 deletions(-)
>> 
> diff --git a/system/cpus.c b/system/cpus.c
> index 68d161d96b..4e41abe23e 100644
> --- a/system/cpus.c
> +++ b/system/cpus.c
> @@ -571,12 +571,14 @@ static bool all_vcpus_paused(void)
>       return true;
>   }
>   
> -void pause_all_vcpus(void)
> +static void request_pause_all_vcpus(void)
>   {
>       CPUState *cpu;
>   
> -    qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);
>       CPU_FOREACH(cpu) {
> +        if (cpu->stopped) {
> +            continue;
> +        }
>           if (qemu_cpu_is_self(cpu)) {
>               qemu_cpu_stop(cpu, true);
>           } else {
> @@ -584,6 +586,14 @@ void pause_all_vcpus(void)
>               qemu_cpu_kick(cpu);
>           }
>       }
> +}
> +
> +void pause_all_vcpus(void)
> +{
> +    qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);
> +
> +retry:
> +    request_pause_all_vcpus();
>   
>       /* We need to drop the replay_lock so any vCPU threads woken up
>        * can finish their replay tasks
> @@ -592,14 +602,23 @@ void pause_all_vcpus(void)
>   
>       while (!all_vcpus_paused()) {
>           qemu_cond_wait(&qemu_pause_cond, &bql);
> -        CPU_FOREACH(cpu) {
> -            qemu_cpu_kick(cpu);
> -        }
> +        /* During we waited on qemu_pause_cond the bql was unlocked,
> +         * the vcpu's state may has been changed by other thread, so
> +         * we must request the pause state on all vcpus again.
> +         */
> +        request_pause_all_vcpus();
>       }
>   
>       bql_unlock();
>       replay_mutex_lock();
>       bql_lock();
> +
> +    /* During the bql was unlocked, the vcpu's state may has been
> +     * changed by other thread, so we must retry.
> +     */
> +    if (!all_vcpus_paused()) {
> +        goto retry;
> +    }
>   }
>   
>   void cpu_resume(CPUState *cpu)

-- 
Cheers,

David / dhildenb

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH v1 0/2] Some fixes for pause and resume all vcpus, Keqian Zhu, 2024/03/17
- [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, Keqian Zhu, 2024/03/17
  - Re: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/18
    - 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, zhukeqian <=
    - Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/19
    - Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/19
    - Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, Peter Maydell, 2024/03/19
    - Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, David Hildenbrand, 2024/03/19
    - Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment, Peter Maydell, 2024/03/19
- [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition, Keqian Zhu, 2024/03/17
  - Re: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition, David Hildenbrand, 2024/03/18
    - 答复: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition, zhukeqian, 2024/03/19
    - Re: 答复: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition, David Hildenbrand, 2024/03/19

Prev by Date: Re: [PATCH] target/i386: Export RFDS bit to guests
Next by Date: [PATCH] target/ppc/mmu-radix64: Use correct string format in walk_tree()
Previous by thread: Re: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
Next by thread: Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
Index(es):
- Date
- Thread