Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO

From:	Daniel P . Berrangé
Subject:	Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
Date:	Fri, 19 Apr 2024 11:20:03 +0100
User-agent:	Mutt/2.2.12 (2023-09-09)
On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 18:35, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 17.04.2024 10:36, Daniel P. Berrangé wrote:
> > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> (..)
> > > > > > That said, the idea of reserving channels specifically for VFIO 
> > > > > > doesn't
> > > > > > make a whole lot of sense to me either.
> > > > > > 
> > > > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > > > doing device state transfer, all the multifd channels are idle.
> > > > > > We should just use all those channels to transfer the device state,
> > > > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > > > during RAM transfer, and further idle channels during vmstate
> > > > > > transfer.
> > > > > > 
> > > > > > IMHO it is more flexible to just use all available multifd channel
> > > > > > resources all the time.
> > > > > 
> > > > > The reason for having dedicated device state channels is that they
> > > > > provide lower downtime in my tests.
> > > > > 
> > > > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > > > channels) I get a downtime of about 1250 msec.
> > > > > 
> > > > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > > > state channels that give downtime of about 1100 ms it means that using
> > > > > dedicated channels gets about 14% downtime improvement.
> > > > 
> > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > > > place ? Is is transferred concurrently with the RAM ? I had thought
> > > > this series still has the RAM transfer iterations running first,
> > > > and then the VFIO VMstate at the end, simply making use of multifd
> > > > channels for parallelism of the end phase. your reply though makes
> > > > me question my interpretation though.
> > > > 
> > > > Let me try to illustrate channel flow in various scenarios, time
> > > > flowing left to right:
> > > > 
> > > > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > > > 
> > > >         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM 
> > > > State |
> > > > 
> > > > 
> > > > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > > > 
> > > >         main: | Init |                                            | VM 
> > > > state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > > 
> > > > 
> > > > 3. parallel RAM, then parallel VM state
> > > > 
> > > >         main: | Init |                                            | VM 
> > > > state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:                                                     | 
> > > > VFIO VM state |
> > > >     multifd5:                                                     | 
> > > > VFIO VM state |
> > > > 
> > > > 
> > > > 4. parallel RAM and VFIO VM state, then remaining VM state
> > > > 
> > > >         main: | Init |                                            | VM 
> > > > state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:        | VFIO VM state                                    
> > > >      |
> > > >     multifd5:        | VFIO VM state                                    
> > > >      |
> > > > 
> > > > 
> > > > I thought this series was implementing approx (3), but are you actually
> > > > implementing (4), or something else entirely ?
> > > 
> > > You are right that this series operation is approximately implementing
> > > the schema described as numer 3 in your diagrams.
> > 
> > > However, there are some additional details worth mentioning:
> > > * There's some but relatively small amount of VFIO data being
> > > transferred from the "save_live_iterate" SaveVMHandler while the VM is
> > > still running.
> > > 
> > > This is still happening via the main migration channel.
> > > Parallelizing this transfer in the future might make sense too,
> > > although obviously this doesn't impact the downtime.
> > > 
> > > * After the VM is stopped and downtime starts the main (~ 400 MiB)
> > > VFIO device state gets transferred via multifd channels.
> > > 
> > > However, these multifd channels (if they are not dedicated to device
> > > state transfer) aren't idle during that time.
> > > Rather they seem to be transferring the residual RAM data.
> > > 
> > > That's most likely what causes the additional observed downtime
> > > when dedicated device state transfer multifd channels aren't used.
> > 
> > Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> > an explanation. Allow me to work through the scenarios though, as I
> > still think my suggestion to not have separate dedicate channels is
> > better....
> > 
> > 
> > Lets say hypothetically we have an existing deployment today that
> > uses 6 multifd channels for RAM. ie:
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> > 
> > That value of 6 was chosen because that corresponds to the amount
> > of network & CPU utilization the admin wants to allow, for this
> > VM to migrate. All 6 channels are fully utilized at all times.
> > 
> > 
> > If we now want to parallelize VFIO VM state, the peak network
> > and CPU utilization the admin wants to reserve for the VM should
> > not change. Thus the admin will still wants to configure only 6
> > channels total.
> > 
> > With your proposal the admin has to reduce RAM transfer to 4 of the
> > channels, in order to then reserve 2 channels for VFIO VM state, so we
> > get a flow like:
> > 
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd5:                                                     | VFIO 
> > VM state |
> >      multifd6:                                                     | VFIO 
> > VM state |
> > 
> > This is bad, as it reduces performance of RAM transfer. VFIO VM
> > state transfer is better, but that's not a net win overall.
> > 
> > 
> > 
> > So lets say the admin was happy to increase the number of multifd
> > channels from 6 to 8.
> > 
> > This series proposes that they would leave RAM using 6 channels as
> > before, and now reserve the 2 extra ones for VFIO VM state:
> > 
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd7:                                                     | VFIO 
> > VM state |
> >      multifd8:                                                     | VFIO 
> > VM state |
> > 
> > 
> > RAM would perform as well as it did historically, and VM state would
> > improve due to the 2 parallel channels, and not competing with the
> > residual RAM transfer.
> > 
> > This is what your latency comparison numbers show as a benefit for
> > this channel reservation design.
> > 
> > I believe this comparison is inappropriate / unfair though, as it is
> > comparing a situation with 6 total channels against a situation with
> > 8 total channels.
> > 
> > If the admin was happy to increase the total channels to 8, then they
> > should allow RAM to use all 8 channels, and then VFIO VM state +
> > residual RAM to also use the very same set of 8 channels:
> > 
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> > 
> > This will speed up initial RAM iters still further & the final switch
> > over phase even more. If residual RAM is larger than VFIO VM state,
> > then it will dominate the switchover latency, so having VFIO VM state
> > compete is not a problem. If VFIO VM state is larger than residual RAM,
> > then allowing it acces to all 8 channels instead of only 2 channels
> > will be a clear win.
> 
> I re-did the measurement with increased the number of multifd channels,
> first to (total count/dedicated count) 25/0, then to 100/0.
> 
> The results did not improve:
> With 25/0 multifd mixed channels config I still get around 1250 msec
> downtime - the same as with 15/0 or 11/0 mixed configs I measured
> earlier.
> 
> But with the (pretty insane) 100/0 mixed channel config the whole setup
> gets so for into the law of diminishing returns that the results actually
> get worse: the downtime is now about 1450 msec.
> I guess that's from all the extra overhead from switching between 100
> multifd channels.
> 
> I think one of the reasons for these results is that mixed (RAM + device
> state) multifd channels participate in the RAM sync process
> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

Hmm, I wouldn't have expected the sync packets to have a signicant
overhead on the wire. Looking at the code though I guess the issue
is that we're blocking I/O in /all/ threads, until all threads have
seen the sync packet.

eg in multifd_recv_sync_main

    for (i = 0; i < thread_count; i++) {

        qemu_sem_wait(&multifd_recv_state->sem_sync);
    }

    for (i = 0; i < thread_count; i++) {
        qemu_sem_post(&p->sem_sync);
    }

and then in the recv thread is 

    qemu_sem_post(&multifd_recv_state->sem_sync);
    qemu_sem_wait(&p->sem_sync);

so if any 1 of the recv threads is slow to recv the sync packet on
the wire, then its qemu_sem_post is delayed, and all other recv
threads are kept idle until the sync packet arrives.

I'm not sure how much this all matters during the final switchover
phase though. We send syncs at the end of each iteration, and then
after sending the residual RAM. I'm not sure how that orders wrt
sending of the parallel VFIO state

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer, (continued)
Prev by Date: Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
Next by Date: Re: [PATCH] tcg: Fix the overflow in indexing tcg_ctx->temps
Previous by thread: Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
Next by thread: RE: [PATCH v6 0/2] Implement SSH commands in QEMU GA for Windows
Index(es):
- Date
- Thread