qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 00/26] Multifd ๐Ÿ”€ device state transfer support with VFIO


From: Peter Xu
Subject: Re: [PATCH RFC 00/26] Multifd ๐Ÿ”€ device state transfer support with VFIO consumer
Date: Thu, 18 Apr 2024 16:02:49 -0400

On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
> On 18.04.2024 12:39, Daniel P. Berrangรฉ wrote:
> > On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
> > > On 17.04.2024 18:35, Daniel P. Berrangรฉ wrote:
> > > > On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
> > > > > On 17.04.2024 10:36, Daniel P. Berrangรฉ wrote:
> > > > > > On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> (..)
> > > > > > That said, the idea of reserving channels specifically for VFIO 
> > > > > > doesn't
> > > > > > make a whole lot of sense to me either.
> > > > > > 
> > > > > > Once we've done the RAM transfer, and are in the switchover phase
> > > > > > doing device state transfer, all the multifd channels are idle.
> > > > > > We should just use all those channels to transfer the device state,
> > > > > > in parallel.  Reserving channels just guarantees many idle channels
> > > > > > during RAM transfer, and further idle channels during vmstate
> > > > > > transfer.
> > > > > > 
> > > > > > IMHO it is more flexible to just use all available multifd channel
> > > > > > resources all the time.
> > > > > 
> > > > > The reason for having dedicated device state channels is that they
> > > > > provide lower downtime in my tests.
> > > > > 
> > > > > With either 15 or 11 mixed multifd channels (no dedicated device state
> > > > > channels) I get a downtime of about 1250 msec.
> > > > > 
> > > > > Comparing that with 15 total multifd channels / 4 dedicated device
> > > > > state channels that give downtime of about 1100 ms it means that using
> > > > > dedicated channels gets about 14% downtime improvement.
> > > > 
> > > > Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
> > > > place ? Is is transferred concurrently with the RAM ? I had thought
> > > > this series still has the RAM transfer iterations running first,
> > > > and then the VFIO VMstate at the end, simply making use of multifd
> > > > channels for parallelism of the end phase. your reply though makes
> > > > me question my interpretation though.
> > > > 
> > > > Let me try to illustrate channel flow in various scenarios, time
> > > > flowing left to right:
> > > > 
> > > > 1. serialized RAM, then serialized VM state  (ie historical migration)
> > > > 
> > > >         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM 
> > > > State |
> > > > 
> > > > 
> > > > 2. parallel RAM, then serialized VM state (ie today's multifd)
> > > > 
> > > >         main: | Init |                                            | VM 
> > > > state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > > 
> > > > 
> > > > 3. parallel RAM, then parallel VM state
> > > > 
> > > >         main: | Init |                                            | VM 
> > > > state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:                                                     | 
> > > > VFIO VM state |
> > > >     multifd5:                                                     | 
> > > > VFIO VM state |
> > > > 
> > > > 
> > > > 4. parallel RAM and VFIO VM state, then remaining VM state
> > > > 
> > > >         main: | Init |                                            | VM 
> > > > state |
> > > >     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
> > > >     multifd4:        | VFIO VM state                                    
> > > >      |
> > > >     multifd5:        | VFIO VM state                                    
> > > >      |
> > > > 
> > > > 
> > > > I thought this series was implementing approx (3), but are you actually
> > > > implementing (4), or something else entirely ?
> > > 
> > > You are right that this series operation is approximately implementing
> > > the schema described as numer 3 in your diagrams.
> > 
> > > However, there are some additional details worth mentioning:
> > > * There's some but relatively small amount of VFIO data being
> > > transferred from the "save_live_iterate" SaveVMHandler while the VM is
> > > still running.
> > > 
> > > This is still happening via the main migration channel.
> > > Parallelizing this transfer in the future might make sense too,
> > > although obviously this doesn't impact the downtime.
> > > 
> > > * After the VM is stopped and downtime starts the main (~ 400 MiB)
> > > VFIO device state gets transferred via multifd channels.
> > > 
> > > However, these multifd channels (if they are not dedicated to device
> > > state transfer) aren't idle during that time.
> > > Rather they seem to be transferring the residual RAM data.
> > > 
> > > That's most likely what causes the additional observed downtime
> > > when dedicated device state transfer multifd channels aren't used.
> > 
> > Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> > an explanation. Allow me to work through the scenarios though, as I
> > still think my suggestion to not have separate dedicate channels is
> > better....
> > 
> > 
> > Lets say hypothetically we have an existing deployment today that
> > uses 6 multifd channels for RAM. ie:
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> > 
> > That value of 6 was chosen because that corresponds to the amount
> > of network & CPU utilization the admin wants to allow, for this
> > VM to migrate. All 6 channels are fully utilized at all times.
> > 
> > 
> > If we now want to parallelize VFIO VM state, the peak network
> > and CPU utilization the admin wants to reserve for the VM should
> > not change. Thus the admin will still wants to configure only 6
> > channels total.
> > 
> > With your proposal the admin has to reduce RAM transfer to 4 of the
> > channels, in order to then reserve 2 channels for VFIO VM state, so we
> > get a flow like:
> > 
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd5:                                                     | VFIO 
> > VM state |
> >      multifd6:                                                     | VFIO 
> > VM state |
> > 
> > This is bad, as it reduces performance of RAM transfer. VFIO VM
> > state transfer is better, but that's not a net win overall.
> > 
> > 
> > 
> > So lets say the admin was happy to increase the number of multifd
> > channels from 6 to 8.
> > 
> > This series proposes that they would leave RAM using 6 channels as
> > before, and now reserve the 2 extra ones for VFIO VM state:
> > 
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM |
> >      multifd7:                                                     | VFIO 
> > VM state |
> >      multifd8:                                                     | VFIO 
> > VM state |
> > 
> > 
> > RAM would perform as well as it did historically, and VM state would
> > improve due to the 2 parallel channels, and not competing with the
> > residual RAM transfer.
> > 
> > This is what your latency comparison numbers show as a benefit for
> > this channel reservation design.
> > 
> > I believe this comparison is inappropriate / unfair though, as it is
> > comparing a situation with 6 total channels against a situation with
> > 8 total channels.
> > 
> > If the admin was happy to increase the total channels to 8, then they
> > should allow RAM to use all 8 channels, and then VFIO VM state +
> > residual RAM to also use the very same set of 8 channels:
> > 
> >          main: | Init |                                            | VM 
> > state |
> >      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> >      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | 
> > Residual RAM + VFIO VM state|
> > 
> > This will speed up initial RAM iters still further & the final switch
> > over phase even more. If residual RAM is larger than VFIO VM state,
> > then it will dominate the switchover latency, so having VFIO VM state
> > compete is not a problem. If VFIO VM state is larger than residual RAM,
> > then allowing it acces to all 8 channels instead of only 2 channels
> > will be a clear win.
> 
> I re-did the measurement with increased the number of multifd channels,
> first to (total count/dedicated count) 25/0, then to 100/0.
> 
> The results did not improve:
> With 25/0 multifd mixed channels config I still get around 1250 msec
> downtime - the same as with 15/0 or 11/0 mixed configs I measured
> earlier.
> 
> But with the (pretty insane) 100/0 mixed channel config the whole setup
> gets so for into the law of diminishing returns that the results actually
> get worse: the downtime is now about 1450 msec.
> I guess that's from all the extra overhead from switching between 100
> multifd channels.

100 threads are probably too much indeed.

However I agree with Dan's question raised, and I'd like to second that.
It so far looks better if the multifd channels can be managed just like a
pool of workers without assignments to specific jobs.  It looks like this
series is already getting there, it's a pity we lose that genericity only
because some other side effects on the ram sync semantics.

> 
> I think one of the reasons for these results is that mixed (RAM + device
> state) multifd channels participate in the RAM sync process
> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

  - complete_precopy_async
  - complete_precopy
  - complete_precopy_async_wait

But perhaps better:

  - complete_precopy_begin
  - complete_precopy
  - complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

I would even ask one step further as what Dan was asking: have you thought
about dumping VFIO states via multifd even during iterations?  Would that
help even more than this series (which IIUC only helps during the blackout
phase)?

It could mean that the "async*" hooks can be done differently, and I'm not
sure whether they're needed at all, e.g. when threads are created during
save_setup but cleaned up in save_cleanup.

Thanks,

> 
> It is possible that there are other subtle performance interactions too,
> but I am not 100% sure about that.
> 
> > With regards,
> > Daniel
> 
> Best regards,
> Maciej
> 

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]