qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Properly quitting qemu immediately after failing migration


From: Max Reitz
Subject: Re: Properly quitting qemu immediately after failing migration
Date: Thu, 2 Jul 2020 09:23:33 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.9.0

On 01.07.20 18:16, Vladimir Sementsov-Ogievskiy wrote:
> 29.06.2020 18:00, Max Reitz wrote:
>> On 29.06.20 16:18, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.06.2020 16:48, Max Reitz wrote:
>>>> Hi,
>>>>
>>>> In an iotest, I’m trying to quit qemu immediately after a migration has
>>>> failed.  Unfortunately, that doesn’t seem to be possible in a clean
>>>> way:
>>>> migrate_fd_cleanup() runs only at some point after the migration state
>>>> is already “failed”, so if I just wait for that “failed” state and
>>>> immediately quit, some cleanup functions may not have been run yet.
>>>>
>>>> This is a problem with dirty bitmap migration at least, because it
>>>> increases the refcount on all block devices that are to be migrated, so
>>>> if we don’t call the cleanup function before quitting, the refcount
>>>> will
>>>> stay elevated and bdrv_close_all() will hit an assertion because those
>>>> block devices are still around after blk_remove_all_bs() and
>>>> blockdev_close_all_bdrv_states().
>>>>
>>>> In practice this particular issue might not be that big of a problem,
>>>> because it just means qemu aborts when the user intended to let it quit
>>>> anyway.  But on one hand I could imagine that there are other clean-up
>>>> paths that should definitely run before qemu quits (although I don’t
>>>> know), and on the other, it’s a problem for my test.
>>>>
>>>> I tried working around the problem for my test by waiting on “Unable to
>>>> write” appearing on stderr, because that indicates that
>>>> migrate_fd_cleanup()’s error_report_err() has been reached.  But on one
>>>> hand, that isn’t really nice, and on the other, it doesn’t even work
>>>> when the failure is on the source side (because then there is no
>>>> s->error for migrate_fd_cleanup() to report).
>>
>> (I’ve now managed to work around it by invoking blockdev-del on a node
>> affected by bitmap migration until it succeeds, because blockdev-del can
>> only succeed once the bitmap migration code has dropped its reference to
>> it.)
>>
>>>> In all, I’m asking:
>>>> (1) Is there a nice solution for me now to delay quitting qemu until
>>>> the
>>>> failed migration has been fully resolved, including the clean-up?
>>>>
>>>> (2) Isn’t it a problem if qemu crashes when you issue “quit” via QMP at
>>>> the wrong time?  Like, maybe lingering subprocesses when using “exec”?
>>>>
>>>>
>>>
>>> I'll look more closely tomorrow, but as a short answer: try my series
>>> "[PATCH v2 00/22] Fix error handling during bitmap postcopy" it
>>> handles different problems around migration failures & qemu shutdown,
>>> probably it will help.
>>
>> Not, it doesn’t seem to.
>>
>> I’m not sure what exactly that series addresses, but FWIW I’m hitting
>> the problem in non-postcopy migration.  What my simplest reproducer
>> does is:
> 
> Bitmaps migration is postcopy by nature (and it may not work without
> migrate-start-postcopy, but work in most simple cases, as when we have
> small bitmap data to migrate, it can migrate during migration downtime).
> Most complicated part of the series were about postcopy. Still it fixes
> some other things.
> 
> It seems, that patch (see the second paragraph)
> "[PATCH v2 10/22] migration/block-dirty-bitmap: cancel migration on
> shutdown"
> 
>>    If target is turned of prior to postcopy finished, target crashes
>>    because busy bitmaps are found at shutdown.
>>    Canceling incoming migration helps, as it removes all unfinished (and
>>    therefore busy) bitmaps.
> 
>>    Similarly on source we crash in bdrv_close_all which asserts that all
>>    bdrv states are removed, because bdrv states involved into dirty
>> bitmap
>>    migration are referenced by it. So, we need to cancel outgoing
>>    migration as well.
>     should address exactly your issue.

Hm.  I’ve tested your series and still hit the issue.

I could imagine that my problem lies with a failed migration that is
automatically “cancelled” by nature, so the problem isn’t that it isn’t
cancelled, but that the clean-up runs after accepting further QMP
commands (like quit).

>>
>> On the source VM:
>>
>> blockdev-add node-name='foo' driver='null-co'
>> block-dirty-bitmap-add node='foo' name='bmap0'
>>
>> (Launch destination VM with some -incoming, e.g.
>> -incoming 'exec: cat /tmp/mig_file')
>>
>> Both on source and destination:
>>
>> migrate-set-capabilities capabilities=[
>>      {capability='events', state=true},
>>      {capability='dirty-bitmaps', state=true}
>> ]
>>
>> On source:
>>
>> migrate uri='exec: cat > /tmp/mig_file'
>>
>> Then wait for a MIGRATION event with data/status == 'failed', and then
>> issue 'quit'.
>>
>> Max
>>
> 
> Can you post exact reproducer iotest?

I didn’t have an iotest until now (because it was a simple test written
up in Ruby), but what I’ve attached should work.

Note that you need system load to trigger the problem (or the clean-up
code is scheduled too quickly), so I usually just run like three
instances concurrently.

(while TEST_DIR=/tmp/t$i ./check 400; do; done)

Max

Attachment: 0001-Quit-crash-reproducer.patch
Description: Text Data

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]