qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Properly quitting qemu immediately after failing migration


From: Vladimir Sementsov-Ogievskiy
Subject: Re: Properly quitting qemu immediately after failing migration
Date: Wed, 1 Jul 2020 19:16:38 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.9.0

29.06.2020 18:00, Max Reitz wrote:
On 29.06.20 16:18, Vladimir Sementsov-Ogievskiy wrote:
29.06.2020 16:48, Max Reitz wrote:
Hi,

In an iotest, I’m trying to quit qemu immediately after a migration has
failed.  Unfortunately, that doesn’t seem to be possible in a clean way:
migrate_fd_cleanup() runs only at some point after the migration state
is already “failed”, so if I just wait for that “failed” state and
immediately quit, some cleanup functions may not have been run yet.

This is a problem with dirty bitmap migration at least, because it
increases the refcount on all block devices that are to be migrated, so
if we don’t call the cleanup function before quitting, the refcount will
stay elevated and bdrv_close_all() will hit an assertion because those
block devices are still around after blk_remove_all_bs() and
blockdev_close_all_bdrv_states().

In practice this particular issue might not be that big of a problem,
because it just means qemu aborts when the user intended to let it quit
anyway.  But on one hand I could imagine that there are other clean-up
paths that should definitely run before qemu quits (although I don’t
know), and on the other, it’s a problem for my test.

I tried working around the problem for my test by waiting on “Unable to
write” appearing on stderr, because that indicates that
migrate_fd_cleanup()’s error_report_err() has been reached.  But on one
hand, that isn’t really nice, and on the other, it doesn’t even work
when the failure is on the source side (because then there is no
s->error for migrate_fd_cleanup() to report).

(I’ve now managed to work around it by invoking blockdev-del on a node
affected by bitmap migration until it succeeds, because blockdev-del can
only succeed once the bitmap migration code has dropped its reference to
it.)

In all, I’m asking:
(1) Is there a nice solution for me now to delay quitting qemu until the
failed migration has been fully resolved, including the clean-up?

(2) Isn’t it a problem if qemu crashes when you issue “quit” via QMP at
the wrong time?  Like, maybe lingering subprocesses when using “exec”?



I'll look more closely tomorrow, but as a short answer: try my series
"[PATCH v2 00/22] Fix error handling during bitmap postcopy" it
handles different problems around migration failures & qemu shutdown,
probably it will help.

Not, it doesn’t seem to.

I’m not sure what exactly that series addresses, but FWIW I’m hitting
the problem in non-postcopy migration.  What my simplest reproducer does is:

Bitmaps migration is postcopy by nature (and it may not work without
migrate-start-postcopy, but work in most simple cases, as when we have
small bitmap data to migrate, it can migrate during migration downtime).
Most complicated part of the series were about postcopy. Still it fixes some 
other things.

It seems, that patch (see the second paragraph)
"[PATCH v2 10/22] migration/block-dirty-bitmap: cancel migration on shutdown"

   If target is turned of prior to postcopy finished, target crashes
   because busy bitmaps are found at shutdown.
   Canceling incoming migration helps, as it removes all unfinished (and
   therefore busy) bitmaps.

   Similarly on source we crash in bdrv_close_all which asserts that all
   bdrv states are removed, because bdrv states involved into dirty bitmap
   migration are referenced by it. So, we need to cancel outgoing
   migration as well.
should address exactly your issue.


On the source VM:

blockdev-add node-name='foo' driver='null-co'
block-dirty-bitmap-add node='foo' name='bmap0'

(Launch destination VM with some -incoming, e.g.
-incoming 'exec: cat /tmp/mig_file')

Both on source and destination:

migrate-set-capabilities capabilities=[
     {capability='events', state=true},
     {capability='dirty-bitmaps', state=true}
]

On source:

migrate uri='exec: cat > /tmp/mig_file'

Then wait for a MIGRATION event with data/status == 'failed', and then
issue 'quit'.

Max


Can you post exact reproducer iotest?

I've tried to reproduce by doing

diff --git a/tests/qemu-iotests/300 b/tests/qemu-iotests/300
index 621a60e179..eeec47f97f 100755
--- a/tests/qemu-iotests/300
+++ b/tests/qemu-iotests/300
@@ -94,23 +94,6 @@ class TestDirtyBitmapMigration(iotests.QMPTestCase):
self.assertEqual(status == 'completed', migration_success)
         if status == 'failed':
-            # Wait until the migration has been cleaned up
-            # (Otherwise, bdrv_close_all() will abort because the
-            # dirty bitmap migration code still holds a reference to
-            # the BDS)
-            # (Unfortunately, there does not appear to be a nicer way
-            # of waiting until a failed migration has been cleaned up)
-            timeout_msg = 'Timeout waiting for migration to be cleaned up'
-            with iotests.Timeout(30, timeout_msg):
-                while os.path.exists(mig_sock):
-                    pass
-
-                # Dropping src_node_name will only work once the
-                # bitmap migration code has released it
-                while 'error' in self.vm_a.qmp('blockdev-del',
-                                               node_name=self.src_node_name):
-                    pass
-
             return
self.vm_a.wait_for_runstate('postmigrate')


with your iotest, but it doesn't reproduce for me.



--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]