[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH] Fix race in live migration failure path
From: |
Shivam Kumar |
Subject: |
Re: [RFC PATCH] Fix race in live migration failure path |
Date: |
Thu, 23 Jan 2025 18:05:28 +0000 |
> On 23 Jan 2025, at 9:57 PM, Peter Xu <peterx@redhat.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Thu, Jan 23, 2025 at 09:53:16AM +0000, Shivam Kumar wrote:
>>
>>
>>> On 22 Jan 2025, at 10:10 PM, Peter Xu <peterx@redhat.com> wrote:
>>>
>>> !-------------------------------------------------------------------|
>>> CAUTION: External Email
>>>
>>> |-------------------------------------------------------------------!
>>>
>>> Hi, Shivam,
>>>
>>> On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote:
>>>> There is one place where we set the migration status to FAILED but we
>>>> don’t set
>>>> s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state
>>>> but
>>>> not sure setting s->error in this case is possible (or required), as it
>>>> seems a
>>>> different workflow to me.
>>>
>>> Yes it's very different indeed. I may not yet fully get the challenge on
>>> how switching to s->error (implies FAILING) would affect this one, though.
>>> I mean, in qemu_savevm_state() it tries to fetch qemufile errors with:
>>>
>>> ret = qemu_file_get_error(f);
>>>
>>> IIUC we could also try to fetch s->error just like what migration thread
>>> does, and if it sets means it's failing?
>> Yes, I can just set s->error in qemu_savevm_state.
>> @@ -1686,7 +1686,6 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
>> - status = MIGRATION_STATUS_FAILED;
>> + s->error = *errp;
>> But my question was: is migrate_fd_cleanup called (where we will set the
>> status
>> to FAILED if s->error is set) in the snapshot workflow?
>
> I see what you meant. It's not called indeed. We may need to process it
> the same as what migrate_fd_cleanup() does.
>
> So far the snapshot code reuses migration code in a partial way, so it's
> not crystal clear where the line is, e.g., obviously it still moves the
> migration state machine but it never shows "active" phase at all (even if
> it has a major chunk of duration that it's actively migrating the data to
> the snapshot files). Here the state machine only goes from SETUP to either
> FAILED or COMPLETED.
>
> From that POV looks like we should duplicate such s->error detection logic
> here on deciding whether to switch to FAILED, if we treat s->error as the
> internal "single failure point" for migration.
I hope setting the state to FAILED at the end of save_snapshot might work:
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1722,7 +1722,7 @@ cleanup:
qemu_savevm_state_cleanup();
if (ret != 0) {
- qatomic_set(&ms->failing, 1);
+ migrate_set_error(ms, *errp);
} else {
migrate_set_state(&ms->state, MIGRATION_STATUS_SETUP,
MIGRATION_STATUS_COMPLETED);
@@ -3054,6 +3054,7 @@ bool save_snapshot(const char *name, bool overwrite,
const char *vmstate,
RunState saved_state = runstate_get();
uint64_t vm_state_size;
g_autoptr(GDateTime) now = g_date_time_new_now_local();
+ MigrationState *ms;
GLOBAL_STATE_CODE();
@@ -3149,8 +3150,13 @@ bool save_snapshot(const char *name, bool overwrite,
const char *vmstate,
the_end:
bdrv_drain_all_end();
-
+ ms = migrate_get_current();
vm_resume(saved_state);
+ if (migrate_has_error(ms)) {
+ migrate_set_state(&ms->state, MIGRATION_STATUS_SETUP,
+ MIGRATION_STATUS_FAILED);
+ }
+
return ret == 0;
}
Does this make sense?
>>
>> I meant: in case of post-copy, what if we have another error somewhere and
>> s->error was set, but then we also saw a file error when we called
>> qemu_file_get_error_obj_any. In this case, migration should fail IMO but it
>> would be paused instead, right?
>
> Yeah you got a point, but I see no good reason to cancel any postcopy
> migration, no matter which error it is - either a qemufile error or another
> - simply because postcopy cancel means VM crash. There's nothing worse
> that that..
>
> So IMHO we could treat it the same as EIO errors in this case as of now,
> and we always pause postcopy no matter which kind of error it hits. At
> least for non-recoverable errors we can have some active process to look
> at on src QEMU instance, OTOH there's no direct benefit for us to
> differenciate different error cases to crash VM earlier.
>
> Thanks,
>
> --
> Peter Xu
>
Yes, this makes sense. Thanks.
Thanks,
Shivam
- [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Fabiano Rosas, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/13
- Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/22
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/22
- Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/23
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23
- Re: [RFC PATCH] Fix race in live migration failure path,
Shivam Kumar <=
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23