qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Properly quitting qemu immediately after failing migration


From: Vladimir Sementsov-Ogievskiy
Subject: Re: Properly quitting qemu immediately after failing migration
Date: Thu, 2 Jul 2020 14:44:52 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

02.07.2020 10:23, Max Reitz wrote:
On 01.07.20 18:16, Vladimir Sementsov-Ogievskiy wrote:
29.06.2020 18:00, Max Reitz wrote:
On 29.06.20 16:18, Vladimir Sementsov-Ogievskiy wrote:
29.06.2020 16:48, Max Reitz wrote:
Hi,

In an iotest, I’m trying to quit qemu immediately after a migration has
failed.  Unfortunately, that doesn’t seem to be possible in a clean
way:
migrate_fd_cleanup() runs only at some point after the migration state
is already “failed”, so if I just wait for that “failed” state and
immediately quit, some cleanup functions may not have been run yet.

This is a problem with dirty bitmap migration at least, because it
increases the refcount on all block devices that are to be migrated, so
if we don’t call the cleanup function before quitting, the refcount
will
stay elevated and bdrv_close_all() will hit an assertion because those
block devices are still around after blk_remove_all_bs() and
blockdev_close_all_bdrv_states().

In practice this particular issue might not be that big of a problem,
because it just means qemu aborts when the user intended to let it quit
anyway.  But on one hand I could imagine that there are other clean-up
paths that should definitely run before qemu quits (although I don’t
know), and on the other, it’s a problem for my test.

I tried working around the problem for my test by waiting on “Unable to
write” appearing on stderr, because that indicates that
migrate_fd_cleanup()’s error_report_err() has been reached.  But on one
hand, that isn’t really nice, and on the other, it doesn’t even work
when the failure is on the source side (because then there is no
s->error for migrate_fd_cleanup() to report).

(I’ve now managed to work around it by invoking blockdev-del on a node
affected by bitmap migration until it succeeds, because blockdev-del can
only succeed once the bitmap migration code has dropped its reference to
it.)

In all, I’m asking:
(1) Is there a nice solution for me now to delay quitting qemu until
the
failed migration has been fully resolved, including the clean-up?

(2) Isn’t it a problem if qemu crashes when you issue “quit” via QMP at
the wrong time?  Like, maybe lingering subprocesses when using “exec”?



I'll look more closely tomorrow, but as a short answer: try my series
"[PATCH v2 00/22] Fix error handling during bitmap postcopy" it
handles different problems around migration failures & qemu shutdown,
probably it will help.

Not, it doesn’t seem to.

I’m not sure what exactly that series addresses, but FWIW I’m hitting
the problem in non-postcopy migration.  What my simplest reproducer
does is:

Bitmaps migration is postcopy by nature (and it may not work without
migrate-start-postcopy, but work in most simple cases, as when we have
small bitmap data to migrate, it can migrate during migration downtime).
Most complicated part of the series were about postcopy. Still it fixes
some other things.

It seems, that patch (see the second paragraph)
"[PATCH v2 10/22] migration/block-dirty-bitmap: cancel migration on
shutdown"

    If target is turned of prior to postcopy finished, target crashes
    because busy bitmaps are found at shutdown.
    Canceling incoming migration helps, as it removes all unfinished (and
    therefore busy) bitmaps.

    Similarly on source we crash in bdrv_close_all which asserts that all
    bdrv states are removed, because bdrv states involved into dirty
bitmap
    migration are referenced by it. So, we need to cancel outgoing
    migration as well.
     should address exactly your issue.

Hm.  I’ve tested your series and still hit the issue.

I could imagine that my problem lies with a failed migration that is
automatically “cancelled” by nature, so the problem isn’t that it isn’t
cancelled, but that the clean-up runs after accepting further QMP
commands (like quit).

Looking at my patch I see

+void dirty_bitmap_mig_cancel_outgoing(void)
+{
+    dirty_bitmap_do_save_cleanup(&dbm_state.save);
+}
+

So, may be "cancel" is just a bad name. It should work, but it doesn't.



On the source VM:

blockdev-add node-name='foo' driver='null-co'
block-dirty-bitmap-add node='foo' name='bmap0'

(Launch destination VM with some -incoming, e.g.
-incoming 'exec: cat /tmp/mig_file')

Both on source and destination:

migrate-set-capabilities capabilities=[
      {capability='events', state=true},
      {capability='dirty-bitmaps', state=true}
]

On source:

migrate uri='exec: cat > /tmp/mig_file'

Then wait for a MIGRATION event with data/status == 'failed', and then
issue 'quit'.

Max


Can you post exact reproducer iotest?

I didn’t have an iotest until now (because it was a simple test written
up in Ruby), but what I’ve attached should work.

Note that you need system load to trigger the problem (or the clean-up
code is scheduled too quickly), so I usually just run like three
instances concurrently.

(while TEST_DIR=/tmp/t$i ./check 400; do; done)

Max


Thanks! Aha, reporoduced on your branch, more than 500 rolls, several (5-6) 
instances.

Interesting, if drop failure-waiting loop, it crashes without any race, just 
crashes.

Move to my branch.

With dropped fail-waiting loop, it crashes in about 17 tries, with several 
instances.

Ahahaha. and with fail-waiting loop as is, it crashes immediately, without a 
race.

So my patch make it work visa-versa. Magic.

For me this looks like my patch just don't do what it should. I'll work on this 
and
resend the series together with new test case. Or may be better to split the 
series,
to address different issues separately.

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]