qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 5/5] vhost: add device started check in migration set log


From: Jason Wang
Subject: Re: [PATCH v2 5/5] vhost: add device started check in migration set log
Date: Wed, 13 May 2020 11:20:50 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0


On 2020/5/12 下午5:35, Dima Stepanov wrote:
On Tue, May 12, 2020 at 11:32:50AM +0800, Jason Wang wrote:
On 2020/5/11 下午5:25, Dima Stepanov wrote:
On Mon, May 11, 2020 at 11:15:53AM +0800, Jason Wang wrote:
On 2020/4/30 下午9:36, Dima Stepanov wrote:
If vhost-user daemon is used as a backend for the vhost device, then we
should consider a possibility of disconnect at any moment. If such
disconnect happened in the vhost_migration_log() routine the vhost
device structure will be clean up.
At the start of the vhost_migration_log() function there is a check:
   if (!dev->started) {
       dev->log_enabled = enable;
       return 0;
   }
To be consistent with this check add the same check after calling the
vhost_dev_set_log() routine. This in general help not to break a
migration due the assert() message. But it looks like that this code
should be revised to handle these errors more carefully.

In case of vhost-user device backend the fail paths should consider the
state of the device. In this case we should skip some function calls
during rollback on the error paths, so not to get the NULL dereference
errors.

Signed-off-by: Dima Stepanov<address@hidden>
---
  hw/virtio/vhost.c | 39 +++++++++++++++++++++++++++++++++++----
  1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 3ee50c4..d5ab96d 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -787,6 +787,17 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
  static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
  {
      int r, i, idx;
+
+    if (!dev->started) {
+        /*
+         * If vhost-user daemon is used as a backend for the
+         * device and the connection is broken, then the vhost_dev
+         * structure will be reset all its values to 0.
+         * Add additional check for the device state.
+         */
+        return -1;
+    }
+
      r = vhost_dev_set_features(dev, enable_log);
      if (r < 0) {
          goto err_features;
@@ -801,12 +812,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, bool 
enable_log)
      }
      return 0;
  err_vq:
-    for (; i >= 0; --i) {
+    /*
+     * Disconnect with the vhost-user daemon can lead to the
+     * vhost_dev_cleanup() call which will clean up vhost_dev
+     * structure.
+     */
+    for (; dev->started && (i >= 0); --i) {
          idx = dev->vhost_ops->vhost_get_vq_index(
Why need the check of dev->started here, can started be modified outside
mainloop? If yes, I don't get the check of !dev->started in the beginning of
this function.

No dev->started can't change outside the mainloop. The main problem is
only for the vhost_user_blk daemon. Consider the case when we
successfully pass the dev->started check at the beginning of the
function, but after it we hit the disconnect on the next call on the
second or third iteration:
      r = vhost_virtqueue_set_addr(dev, dev->vqs + i, idx, enable_log);
The unix socket backend device will call the disconnect routine for this
device and reset the structure. So the structure will be reset (and
dev->started set to false) inside this set_addr() call.
I still don't get here. I think the disconnect can not happen in the middle
of vhost_dev_set_log() since both of them were running in mainloop. And even
if it can, we probably need other synchronization mechanism other than
simple check here.
Disconnect isn't happened in the separate thread it is happened in this
routine inside vhost_dev_set_log. When for instance vhost_user_write()
call failed:
   vhost_user_set_log_base()
     vhost_user_write()
       vhost_user_blk_disconnect()
         vhost_dev_cleanup()
           vhost_user_backend_cleanup()
So the point is that if we somehow got a disconnect with the
vhost-user-blk daemon before the vhost_user_write() call then it will
continue clean up by running vhost_user_blk_disconnect() function. I
wrote a more detailed backtrace stack in the separate thread, which is
pretty similar to what we have here:
   Re: [PATCH v2 4/5] vhost: check vring address before calling unmap
The places are different but the problem is pretty similar.


Yes.



So if vhost-user commands handshake then everything is fine and
reconnect will work as expected. The only problem is how to handle
reconnect properly between vhost-user command send/receive.

As i wrote we have a test:
   - run src VM with vhost-usr-blk daemon used
   - run fio inside it
   - perform reconnect every X seconds (just kill and restart daemon),
     X is random
   - run dst VM
   - perform migration
   - fio should complete in dst VM
And we cycle this test like forever.
So it fails once per ~25 iteration. By adding some delays inside qemu we
were able to make the race window larger.


It would be better if we can draft some qtest for this.

Thanks







reply via email to

[Prev in Thread] Current Thread [Next in Thread]