qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 00/26] Multifd ๐Ÿ”€ device state transfer support with VFIO


From: Maciej S. Szmigiero
Subject: Re: [PATCH RFC 00/26] Multifd ๐Ÿ”€ device state transfer support with VFIO consumer
Date: Thu, 18 Apr 2024 11:50:12 +0200
User-agent: Mozilla Thunderbird

On 17.04.2024 18:35, Daniel P. Berrangรฉ wrote:
On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
On 17.04.2024 10:36, Daniel P. Berrangรฉ wrote:
On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

VFIO device state transfer is currently done via the main migration channel.
This means that transfers from multiple VFIO devices are done sequentially
and via just a single common migration channel.

Such way of transferring VFIO device state migration data reduces
performance and severally impacts the migration downtime (~50%) for VMs
that have multiple such devices with large state size - see the test
results below.

However, we already have a way to transfer migration data using multiple
connections - that's what multifd channels are.

Unfortunately, multifd channels are currently utilized for RAM transfer
only.
This patch set adds a new framework allowing their use for device state
transfer too.

The wire protocol is based on Avihai's x-channel-header patches, which
introduce a header for migration channels that allow the migration source
to explicitly indicate the migration channel type without having the
target deduce the channel type by peeking in the channel's content.

The new wire protocol can be switch on and off via migration.x-channel-header
option for compatibility with older QEMU versions and testing.
Switching the new wire protocol off also disables device state transfer via
multifd channels.

The device state transfer can happen either via the same multifd channels
as RAM data is transferred, mixed with RAM data (when
migration.x-multifd-channels-device-state is 0) or exclusively via
dedicated device state transfer channels (when
migration.x-multifd-channels-device-state > 0).

Using dedicated device state transfer multifd channels brings further
performance benefits since these channels don't need to participate in
the RAM sync process.
I'm not convinced there's any need to introduce the new "channel header"
protocol messages. The multifd channels already have an initialization
message that is extensible to allow extra semantics to be indicated.
So if we want some of the multifd channels to be reserved for device
state, we could indicate that via some data in the MultiFDInit_t
message struct.
The reason for introducing x-channel-header was to avoid having to deduce
the channel type by peeking in the channel's content - where any channel
that does not start with QEMU_VM_FILE_MAGIC is currently treated as a
multifd one.

But if this isn't desired then, as you say, the multifd channel type can
be indicated by using some unused field of the MultiFDInit_t message.

Of course, this would still keep the QEMU_VM_FILE_MAGIC heuristic then.
I don't like the heuristics we currently have, and would to have
a better solution. What makes me cautious is that this proposal
is a protocol change, but only addressing one very narrow problem
with the migration protocol.

I'd like migration to see a more explicit bi-directional protocol
negotiation message set, where both QEMU can auto-negotiate amongst
themselves many of the features that currently require tedious
manual configuration by mgmt apps via migrate parameters/capabilities.
That would address the problem you describe here, and so much more.
Isn't the capability negotiation handled automatically by libvirt
today?
I guess you'd prefer for QEMU to internally handle it instead?

If we add this channel header feature now, it creates yet another
thing to keep around for back compatibility. So if this is not
strictly required, in order to solve the VFIO VMstate problem, I'd
prefer to just solve the VMstate stuff on its own.
Okay, got it.

That said, the idea of reserving channels specifically for VFIO doesn't
make a whole lot of sense to me either.

Once we've done the RAM transfer, and are in the switchover phase
doing device state transfer, all the multifd channels are idle.
We should just use all those channels to transfer the device state,
in parallel.  Reserving channels just guarantees many idle channels
during RAM transfer, and further idle channels during vmstate
transfer.

IMHO it is more flexible to just use all available multifd channel
resources all the time.
The reason for having dedicated device state channels is that they
provide lower downtime in my tests.

With either 15 or 11 mixed multifd channels (no dedicated device state
channels) I get a downtime of about 1250 msec.

Comparing that with 15 total multifd channels / 4 dedicated device
state channels that give downtime of about 1100 ms it means that using
dedicated channels gets about 14% downtime improvement.
Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
place ? Is is transferred concurrently with the RAM ? I had thought
this series still has the RAM transfer iterations running first,
and then the VFIO VMstate at the end, simply making use of multifd
channels for parallelism of the end phase. your reply though makes
me question my interpretation though.

Let me try to illustrate channel flow in various scenarios, time
flowing left to right:

1. serialized RAM, then serialized VM state  (ie historical migration)

       main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |


2. parallel RAM, then serialized VM state (ie today's multifd)

       main: | Init |                                            | VM state |
   multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |


3. parallel RAM, then parallel VM state

       main: | Init |                                            | VM state |
   multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd4:                                                     | VFIO VM 
state |
   multifd5:                                                     | VFIO VM 
state |


4. parallel RAM and VFIO VM state, then remaining VM state

       main: | Init |                                            | VM state |
   multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
   multifd4:        | VFIO VM state                                         |
   multifd5:        | VFIO VM state                                         |


I thought this series was implementing approx (3), but are you actually
implementing (4), or something else entirely ?
You are right that this series operation is approximately implementing
the schema described as numer 3 in your diagrams.

However, there are some additional details worth mentioning:
* There's some but relatively small amount of VFIO data being
transferred from the "save_live_iterate" SaveVMHandler while the VM is
still running.

This is still happening via the main migration channel.
Parallelizing this transfer in the future might make sense too,
although obviously this doesn't impact the downtime.

* After the VM is stopped and downtime starts the main (~ 400 MiB)
VFIO device state gets transferred via multifd channels.

However, these multifd channels (if they are not dedicated to device
state transfer) aren't idle during that time.
Rather they seem to be transferring the residual RAM data.

That's most likely what causes the additional observed downtime
when dedicated device state transfer multifd channels aren't used.

With regards,
Daniel
Best regards,
Maciej




reply via email to

[Prev in Thread] Current Thread [Next in Thread]