[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH] Fix race in live migration failure path
From: |
Peter Xu |
Subject: |
Re: [RFC PATCH] Fix race in live migration failure path |
Date: |
Mon, 13 Jan 2025 11:29:46 -0500 |
On Fri, Jan 10, 2025 at 10:09:38AM -0300, Fabiano Rosas wrote:
> Shivam Kumar <shivam.kumar1@nutanix.com> writes:
>
> > Even if a live migration fails due to some reason, migration status
> > should not be set to MIGRATION_STATUS_FAILED until migrate fd cleanup
> > is done, else the client can trigger another instance of migration
> > before the cleanup is complete (as it would assume no migration is
> > active) or reset migration capabilities affecting old migration's
> > cleanup. Hence, set the status to 'failing' when a migration failure
> > happens and once the cleanup is complete, set the migration status to
> > MIGRATION_STATUS_FAILED.
> >
> > Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> > ---
> > migration/migration.c | 49 +++++++++++++++++++++----------------------
> > migration/migration.h | 9 ++++++++
> > migration/multifd.c | 6 ++----
> > migration/savevm.c | 7 +++----
> > 4 files changed, 38 insertions(+), 33 deletions(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index df61ca4e93..f084f54f6b 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1143,8 +1143,9 @@ static bool migration_is_active(void)
>
> migration_is_running() is the one that gates qmp_migrate() and
> qmp_migrate_set_capabilities().
>
> > {
> > MigrationState *s = current_migration;
> >
> > - return (s->state == MIGRATION_STATUS_ACTIVE ||
> > - s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > + return ((s->state == MIGRATION_STATUS_ACTIVE ||
> > + s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) &&
> > + !qatomic_read(&s->failing));
> > }
> >
> > static bool migrate_show_downtime(MigrationState *s)
> > @@ -1439,6 +1440,11 @@ static void migrate_fd_cleanup(MigrationState *s)
> > MIGRATION_STATUS_CANCELLED);
> > }
> >
> > + if (qatomic_xchg(&s->failing, 0)) {
> > + migrate_set_state(&s->state, s->state,
> > + MIGRATION_STATUS_FAILED);
> > + }
>
> I hope you've verified that sure every place that used to set FAILED
> will also reach migrate_fd_cleanup() eventually.
>
> Also, we probably still need the FAILING state. Otherwise, this will
> trip code that expects a state change on failure. Anything that does:
>
> if (state != MIGRATION_STATUS_FOO) {
> ...
> }
>
> So I think the change overall should be
>
> -migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILING);
>
> void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
> MigrationStatus new_state)
> {
> assert(new_state < MIGRATION_STATUS__MAX);
> if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
> trace_migrate_set_state(MigrationStatus_str(new_state));
>
> + if (new_state == MIGRATION_STATUS_FAILING) {
> + qatomic_set(&s->failing, 1);
> + }
> migrate_generate_event(new_state);
> }
> }
>
> And we should proably do the same for CANCELLING actually, but there the
> (preexisting) issue is actual concurrency, while here it's just
> inconsistency in the state.
Yes something like FAILING sounds reasonable. Though since we have
s->error, I wonder whether that's a better place to represent a migration
as "failing" in one place, because otherwise we need to set two places
(both FAILING state, and the s->error) - whenever something fails, we'd
better always update s->error so as to remember what failed, then reported
via query-migrate.
>From that POV, s->failing is probably never gonna be needed (due to
s->error being present anyway)? So far, such Error* looks like the best
single point to say that the migration is failing - it also enforces the
Error to be provided whoever wants to set it to failing state.
--
Peter Xu