qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MultiFD and default channel out of order mapping on receive side.


From: Peter Xu
Subject: Re: MultiFD and default channel out of order mapping on receive side.
Date: Mon, 17 Oct 2022 17:15:35 -0400

On Mon, Oct 17, 2022 at 12:38:30PM +0100, Daniel P. Berrangé wrote:
> On Mon, Oct 17, 2022 at 01:06:00PM +0530, manish.mishra wrote:
> > Hi Daniel,
> > 
> > I was thinking for some solutions for this so wanted to discuss that before 
> > going ahead. Also added Juan and Peter in loop.
> > 
> > 1. Earlier i was thinking, on destination side as of now for default
> > and multi-FD channel first data to be sent is MAGIC_NUMBER and VERSION
> > so may be we can decide mapping based on that. But then that does not
> > work for newly added post copy preempt channel as it does not send
> > any MAGIC number. Also even for multiFD just MAGIC number does not
> > tell which multifd channel number is it, even though as per my thinking
> > it does not matter. So MAGIC number should be good for indentifying
> > default vs multiFD channel?
> 
> Yep, you don't need to know more than the MAGIC value.
> 
> In migration_io_process_incoming, we need to use MSG_PEEK to look at
> the first 4 bytes pendingon the wire. If those bytes are 'QEVM' that's
> the primary channel, if those bytes are big endian 0x11223344, that's
> a multifd channel.  Using MSG_PEEK aviods need to modify thue later
> code that actually reads this data.
> 
> The challenge is how long to wait with the MSG_PEEK. If we do it
> in a blocking mode, its fine for main channel and multifd, but
> IIUC for the post-copy pre-empt channel we'd be waiting for
> something that will never arrive.
> 
> Having suggested MSG_PEEK though, this may well not work if the
> channel has TLS present. In fact it almost definitely won't work.
> 
> To cope with TLS migration_io_process_incoming would need to
> actually read the data off the wire, and later methods be
> taught to skip reading the magic.
> 
> > 2. For post-copy preempt may be we can initiate this channel only
> > after we have received a request from remote e.g. remote page fault.
> > This to me looks safest considering post-copy recorvery case too.
> > I can not think of any depedency on post copy preempt channel which
> > requires it to be initialised very early. May be Peter can confirm
> > this.
> 
> I guess that could work

Currently all preempt code still assumes when postcopy activated it's in
preempt mode.  IIUC such a change will bring an extra phase of postcopy
with no-preempt before preempt enabled.  We may need to teach qemu to
understand that if it's needed.

Meanwhile the initial page requests will not be able to benefit from the
new preempt channel too.

> 
> > 3. Another thing we can do is to have 2-way handshake on every
> > channel creation with some additional metadata, this to me looks
> > like cleanest approach and durable, i understand that can break
> > migration to/from old qemu, but then that can come as migration
> > capability?
> 
> The benefit of (1) is that the fix can be deployed for all existing
> QEMU releases by backporting it.  (3) will meanwhile need mgmt app
> updates to make it work, which is much more work to deploy.
> 
> We really shoulud have had a more formal handshake, and I've described
> ways to achieve this in the past, but it is quite alot of work.

I don't know whether (1) is a valid option if there are use cases that it
cannot cover (on either tls or preempt).  The handshake is definitely the
clean approach.

What's the outcome of such wrongly ordered connections?  Will migration
fail immediately and safely?

For multifd, I think it should fail immediately after the connection
established.

For preempt, I'd also expect the same thing because the only wrong order to
happen right now is having the preempt channel to be the migration channel,
then it should also fail immediately on the first qemu_get_byte().

Hopefully that's still not too bad - I mean, if we can fail constantly and
safely (never fail during postcopy), we can always retry and as long as
connections created successfully we can start the migration safely.  But
please correct me if it's not the case.

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]