qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation


From: Kevin Wolf
Subject: Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation
Date: Tue, 4 May 2021 11:27:12 +0200

Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > 
> > Usually, an error during initialisation means that the configuration was
> > wrong. Reconnecting won't make the error go away, but just turn the
> > error condition into an endless loop. Avoid this and return errors
> > again.
> 
> So there are several possible reasons for an error:
> 
> 1. remote restarted - we would like to reconnect,
>    this was the original use-case for reconnect.
> 
>    I am not very happy that we are killing this usecase.

This patch is killing it only during initialisation, where it's quite
unlikely compared to other cases and where the current implementation is
rather broken. So reverting the broken feature and going back to a
simpler correct state feels like a good idea to me.

The idea is to add the "retry during initialisation" feature back on top
of this, but it requires some more changes in the error paths so that we
can actually distinguish different kinds of errors and don't retry when
we already know that it can't succeed.

> 2. qemu detected an error and closed the connection
>    looks like we try to handle that by reconnect,
>    this is something we should address.

Yes, if qemu produces the error locally, retrying is useless.

> 3. remote failed due to a bad command from qemu.
>    this usecase isn't well supported at the moment.
> 
>    How about supporting it on the remote side? I think that if the
>    data is well-formed just has a configuration remote can not support
>    then instead of closing the connection, remote can wait for
>    commands with need_reply set, and respond with an error. Or at
>    least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
>    If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
>    also be reasonable.
> 
>    OTOH if qemu is buggy and sends malformed data and remote detects
>    that then hacing qemu retry forever is ok, might actually be
>    benefitial for debugging.

I haven't really checked this case yet, it seems to be less common.
Explicitly communicating an error is certainly better than just cutting
the connection. But as you say, it means QEMU is buggy, so blindly
retrying in this case is kind of acceptable.

Raphael suggested that we could limit the number of retries during
initialisation so that it wouldn't result in a hang at least.

> > Additionally, calling vhost_user_blk_disconnect() from the chardev event
> > handler could result in use-after-free because none of the
> > initialisation code expects that the device could just go away in the
> > middle. So removing the call fixes crashes in several places.
> > For example, using a num-queues setting that is incompatible with the
> > backend would result in a crash like this (dereferencing dev->opaque,
> > which is already NULL):
> > 
> >  #0  0x0000555555d0a4bd in vhost_user_read_cb (source=0x5555568f4690, 
> > condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffffffcbf0) at 
> > ../hw/virtio/vhost-user.c:313
> >  #1  0x0000555555d950d3 in qio_channel_fd_source_dispatch 
> > (source=0x555557c3f750, callback=0x555555d0a478 <vhost_user_read_cb>, 
> > user_data=0x7fffffffcbf0) at ../io/channel-watch.c:84
> >  #2  0x00007ffff7b32a9f in g_main_context_dispatch () at 
> > /lib64/libglib-2.0.so.0
> >  #3  0x00007ffff7b84a98 in g_main_context_iterate.constprop () at 
> > /lib64/libglib-2.0.so.0
> >  #4  0x00007ffff7b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
> >  #5  0x0000555555d0a724 in vhost_user_read (dev=0x555557bc62f8, 
> > msg=0x7fffffffcc50) at ../hw/virtio/vhost-user.c:402
> >  #6  0x0000555555d0ee6b in vhost_user_get_config (dev=0x555557bc62f8, 
> > config=0x555557bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
> >  #7  0x0000555555d56d46 in vhost_dev_get_config (hdev=0x555557bc62f8, 
> > config=0x555557bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
> >  #8  0x0000555555cdd150 in vhost_user_blk_device_realize 
> > (dev=0x555557bc60b0, errp=0x7fffffffcf90) at 
> > ../hw/block/vhost-user-blk.c:510
> >  #9  0x0000555555d08f6d in virtio_device_realize (dev=0x555557bc60b0, 
> > errp=0x7fffffffcff0) at ../hw/virtio/virtio.c:3660
> 
> Right. So that's definitely something to fix.
> 
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]