Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part)

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part)

From:	Keqian Zhu
Subject:	Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part)
Date:	Thu, 25 Mar 2021 09:21:53 +0800
User-agent:	Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.7.1

Peter,

On 2021/3/24 23:09, Peter Xu wrote:
> On Wed, Mar 24, 2021 at 10:56:22AM +0800, Keqian Zhu wrote:
>> Hi Peter,
>>
>> On 2021/3/23 22:34, Peter Xu wrote:
>>> Keqian,
>>>
>>> On Tue, Mar 23, 2021 at 02:40:43PM +0800, Keqian Zhu wrote:
>>>>>> The second question is that you observed longer migration time 
>>>>>> (55s->73s) when guest
>>>>>> has 24G ram and dirty rate is 800M/s. I am not clear about the reason. 
>>>>>> As with dirty
>>>>>> ring enabled, Qemu can get dirty info faster which means it handles 
>>>>>> dirty page more
>>>>>> quick, and guest can be throttled which means dirty page is generated 
>>>>>> slower. What's
>>>>>> the rationale for the longer migration time?
>>>>>
>>>>> Because dirty ring is more sensitive to dirty rate, while dirty bitmap is 
>>>>> more
>>>> Emm... Sorry that I'm very clear about this... I think that higher dirty 
>>>> rate doesn't cause
>>>> slower dirty_log_sync compared to that of legacy bitmap mode. Besides, 
>>>> higher dirty rate
>>>> means we may have more full-exit, which can properly limit the dirty rate. 
>>>> So it seems that
>>>> dirty ring "prefers" higher dirty rate.
>>>
>>> When I measured the 800MB/s it's in the guest, after throttling.
>>>
>>> Imagine another example: a VM has 1G memory keep dirtying with 10GB/s.  
>>> Dirty
>>> logging will need to collect even less for each iteration because memory 
>>> size
>>> shrinked, collect even less frequent due to the high dirty rate, however 
>>> dirty
>>> ring will use 100% cpu power to collect dirty pages because the ring keeps 
>>> full.
>> Looks good.
>>
>> We have many places to collect dirty pages: the background reaper, vCPU exit 
>> handler,
>> and the migration thread. I think migration time is closely related to the 
>> migration thread.
>>
>> The migration thread calls kvm_dirty_ring_flush().
>> 1. kvm_cpu_synchronize_kick_all() will wait vcpu handles full-exit.
>> 2. kvm_dirty_ring_reap() collects and resets dirty pages.
>> The above two operation will spend more time with higher dirty rate.
>>
>> But I suddenly realize that the key problem maybe not at this. Though we 
>> have separate
>> "reset" operation for dirty ring, actually it is performed right after we 
>> collect dirty
>> ring to kvmslot. So in dirty ring mode, it likes legacy bitmap mode without 
>> manual_dirty_clear.
>>
>> If we can "reset" dirty ring just before we really handle the dirty pages, 
>> we can have
>> shorter migration time. But the design of dirty ring doesn't allow this, 
>> because we must
>> perform reset to make free space...
> 
> This is a very good point.
> 
> Dirty ring should have been better in quite some ways already, but from that
> pov as you said it goes a bit backwards on reprotection of pages (not to
> mention currently we can't even reset the ring per-vcpu; that seems to be not
> fully matching the full locality that the rings have provided as well; but
> Paolo and I discussed with that issue, it's about TLB flush expensiveness, so
> we still need to think more of it..).
> 
> Ideally the ring could have been both per-vcpu but also bi-directional (then
> we'll have 2*N rings, N=vcpu number), so as to split the state transition into
> "dirty ring" and "reprotect ring", then that reprotect ring will be the clear
> dirty log.  That'll look more like virtio as used ring.  However we'll still
> need to think about the TLB flush issue too as Paolo used to mention, as
> that'll exist too with any per-vcpu flush model (each reprotect of page will
> need a tlb flush of all vcpus).
> 
> Or.. maybe we can make the flush ring a standalone one, so we need N dirty 
> ring
> and one global flush ring.
Yep, have separate "reprotect" ring(s) is a good idea.

> 
> Anyway.. Before that, I'd still think the next step should be how to integrate
> qemu to fully leverage current ring model, so as to be able to throttle in
> per-vcpu fashion.
> 
> The major issue (IMHO) with huge VM migration is:
> 
>   1. Convergence
>   2. Responsiveness
> 
> Here we'll have a chance to solve (1) by highly throttle the working vcpu
> threads, meanwhile still keep (2) by not throttle user interactive threads.
> I'm not sure whether this will really work as expected, but just show what I'm
> thinking about it.  These may not matter a lot yet with further improving ring
> reset mechanism, which definitely sounds even better, but seems orthogonal.
> 
> That's also why I think we should still merge this series first as a fundation
> for the rest.
I see.

> 
>>
>>>
>>>>
>>>>> sensitive to memory footprint.  In above 24G mem + 800MB/s dirty rate
>>>>> condition, dirty bitmap seems to be more efficient, say, collecting dirty
>>>>> bitmap of 24G mem (24G/4K/8=0.75MB) for each migration cycle is fast 
>>>>> enough.
>>>>>
>>>>> Not to mention that current implementation of dirty ring in QEMU is not
>>>>> complete - we still have two more layers of dirty bitmap, so it's 
>>>>> actually a
>>>>> mixture of dirty bitmap and dirty ring.  This series is more like a POC on
>>>>> dirty ring interface, so as to let QEMU be able to run on KVM dirty ring.
>>>>> E.g., we won't have hang issue when getting dirty pages since it's totally
>>>>> async, however we'll still have some legacy dirty bitmap issues e.g. 
>>>>> memory
>>>>> consumption of userspace dirty bitmaps are still linear to memory 
>>>>> footprint.
>>>> The plan looks good and coordinated, but I have a concern. Our dirty ring 
>>>> actually depends
>>>> on the structure of hardware logging buffer (PML buffer). We can't say it 
>>>> can be properly
>>>> adapted to all kinds of hardware design in the future.
>>>
>>> Sorry I don't get it - dirty ring can work with pure page wr-protect too?
>> Sure, it can. I just want to discuss many possible kinds of hardware logging 
>> buffer.
>> However, I'd like to stop at this, at least dirty ring works well with PML. 
>> :)
> 
> I see your point.  That'll be a good topic at least when we'd like to port
> dirty ring to other archs for sure.  However as you see I hoped we can start 
> to
> use dirty ring first, find issues, fix it, even redesign some of it, make it
> really beneficial at least on one arch, then it'll make more sense to port it,
> or attract people porting it. :)
> 
> QEMU does not yet have a good solution for huge vm migration yet.  Maybe dirty
> ring is a good start for it, maybe not (e.g., with uffd minor mode postcopy 
> has
> the other chance).  We'll see...
OK.

Thanks,
Keqian

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v5 10/10] KVM: Dirty ring support, (continued)
- [PATCH v5 07/10] KVM: Cache kvm slot dirty bitmap size, Peter Xu, 2021/03/10
- Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Peter Xu, 2021/03/19
- Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Keqian Zhu, 2021/03/22
  - Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Peter Xu, 2021/03/22
    - Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Keqian Zhu, 2021/03/23
    - Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Peter Xu, 2021/03/23
    - Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Keqian Zhu, 2021/03/23
    - Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Peter Xu, 2021/03/24
    - Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part), Keqian Zhu <=

Prev by Date: Re: [PATCH 1/3] aspeed: Coding Style cleanups on do_hash_operation
Next by Date: Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
Previous by thread: Re: [PATCH v5 00/10] KVM: Dirty ring support (QEMU part)
Next by thread: [PULL 00/22] Trivial branch for 6.0 patches
Index(es):
- Date
- Thread