qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH] Fix race in live migration failure path


From: Peter Xu
Subject: Re: [RFC PATCH] Fix race in live migration failure path
Date: Mon, 13 Jan 2025 11:29:46 -0500

On Fri, Jan 10, 2025 at 10:09:38AM -0300, Fabiano Rosas wrote:
> Shivam Kumar <shivam.kumar1@nutanix.com> writes:
> 
> > Even if a live migration fails due to some reason, migration status
> > should not be set to MIGRATION_STATUS_FAILED until migrate fd cleanup
> > is done, else the client can trigger another instance of migration
> > before the cleanup is complete (as it would assume no migration is
> > active) or reset migration capabilities affecting old migration's
> > cleanup. Hence, set the status to 'failing' when a migration failure
> > happens and once the cleanup is complete, set the migration status to
> > MIGRATION_STATUS_FAILED.
> >
> > Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> > ---
> >  migration/migration.c | 49 +++++++++++++++++++++----------------------
> >  migration/migration.h |  9 ++++++++
> >  migration/multifd.c   |  6 ++----
> >  migration/savevm.c    |  7 +++----
> >  4 files changed, 38 insertions(+), 33 deletions(-)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index df61ca4e93..f084f54f6b 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1143,8 +1143,9 @@ static bool migration_is_active(void)
> 
> migration_is_running() is the one that gates qmp_migrate() and
> qmp_migrate_set_capabilities().
> 
> >  {
> >      MigrationState *s = current_migration;
> >  
> > -    return (s->state == MIGRATION_STATUS_ACTIVE ||
> > -            s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
> > +    return ((s->state == MIGRATION_STATUS_ACTIVE ||
> > +            s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) &&
> > +            !qatomic_read(&s->failing));
> >  }
> >  
> >  static bool migrate_show_downtime(MigrationState *s)
> > @@ -1439,6 +1440,11 @@ static void migrate_fd_cleanup(MigrationState *s)
> >                            MIGRATION_STATUS_CANCELLED);
> >      }
> >  
> > +    if (qatomic_xchg(&s->failing, 0)) {
> > +        migrate_set_state(&s->state, s->state,
> > +                          MIGRATION_STATUS_FAILED);
> > +    }
> 
> I hope you've verified that sure every place that used to set FAILED
> will also reach migrate_fd_cleanup() eventually.
> 
> Also, we probably still need the FAILING state. Otherwise, this will
> trip code that expects a state change on failure. Anything that does:
> 
> if (state != MIGRATION_STATUS_FOO) {
>    ...
> }
> 
> So I think the change overall should be
> 
> -migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> +migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILING);
> 
>  void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
>                         MigrationStatus new_state)
>  {
>      assert(new_state < MIGRATION_STATUS__MAX);
>      if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
>          trace_migrate_set_state(MigrationStatus_str(new_state));
> 
> +        if (new_state == MIGRATION_STATUS_FAILING) {
> +            qatomic_set(&s->failing, 1);
> +        }
>          migrate_generate_event(new_state);
>      }
>  }
> 
> And we should proably do the same for CANCELLING actually, but there the
> (preexisting) issue is actual concurrency, while here it's just
> inconsistency in the state.

Yes something like FAILING sounds reasonable.  Though since we have
s->error, I wonder whether that's a better place to represent a migration
as "failing" in one place, because otherwise we need to set two places
(both FAILING state, and the s->error) - whenever something fails, we'd
better always update s->error so as to remember what failed, then reported
via query-migrate.

>From that POV, s->failing is probably never gonna be needed (due to
s->error being present anyway)?  So far, such Error* looks like the best
single point to say that the migration is failing - it also enforces the
Error to be provided whoever wants to set it to failing state.

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]