Re: [RFC PATCH] Fix race in live migration failure path

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH] Fix race in live migration failure path

From:	Peter Xu
Subject:	Re: [RFC PATCH] Fix race in live migration failure path
Date:	Thu, 23 Jan 2025 11:27:15 -0500

On Thu, Jan 23, 2025 at 09:53:16AM +0000, Shivam Kumar wrote:
> 
> 
> > On 22 Jan 2025, at 10:10 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > Hi, Shivam,
> > 
> > On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote:
> >> There is one place where we set the migration status to FAILED but we 
> >> don’t set
> >> s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state 
> >> but
> >> not sure setting s->error in this case is possible (or required), as it 
> >> seems a
> >> different workflow to me.
> > 
> > Yes it's very different indeed.  I may not yet fully get the challenge on
> > how switching to s->error (implies FAILING) would affect this one, though.
> > I mean, in qemu_savevm_state() it tries to fetch qemufile errors with:
> > 
> >    ret = qemu_file_get_error(f);
> > 
> > IIUC we could also try to fetch s->error just like what migration thread
> > does, and if it sets means it's failing?
> Yes, I can just set s->error in qemu_savevm_state.
> @@ -1686,7 +1686,6 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
> -        status = MIGRATION_STATUS_FAILED;
> +       s->error = *errp;
> But my question was: is migrate_fd_cleanup called (where we will set the 
> status
> to FAILED if s->error is set) in the snapshot workflow? 

I see what you meant.  It's not called indeed. We may need to process it
the same as what migrate_fd_cleanup() does.

So far the snapshot code reuses migration code in a partial way, so it's
not crystal clear where the line is, e.g., obviously it still moves the
migration state machine but it never shows "active" phase at all (even if
it has a major chunk of duration that it's actively migrating the data to
the snapshot files).  Here the state machine only goes from SETUP to either
FAILED or COMPLETED.

>From that POV looks like we should duplicate such s->error detection logic
here on deciding whether to switch to FAILED, if we treat s->error as the
internal "single failure point" for migration.

> > 
> >> 
> >> In addition, one potentially real problem that I see is this comment in
> >> migration_detect_error:
> >> /*
> >> * For postcopy, we allow the network to be down for a
> >> * while. After that, it can be continued by a
> >> * recovery phase.
> >> */
> >> Let's say if we set s->error at some place and there was a file error on 
> >> either
> >> source or destination (qemu_file_get_error_obj_any returns a positive value
> > 
> > This is trivial, but I suppose you meant s/positive/negative/ here.. as
> > qemufile's last_error should always be negative, iiuc.
> > 
> >> when called by migration_detect_error). We expect migration to fail in this
> >> case but migration will continue to run since post-copy migration is 
> >> tolerant
> >> to file errors?
> > 
> > Yes it can halt at postcopy_pause().  I'm not yet understand why it's an
> > issue to using s->error, though.
> > 
> > In general, I'm thinking whether we could also check s->error in detect
> > error path like this:
> > 
> > ===8<===
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 2d1da917c7..fbd97395e0 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3015,17 +3015,17 @@ static MigThrError 
> > migration_detect_error(MigrationState *s)
> >     ret = qemu_file_get_error_obj_any(s->to_dst_file,
> >                                       s->postcopy_qemufile_src,
> >                                       &local_error);
> > -    if (!ret) {
> > -        /* Everything is fine */
> > -        assert(!local_error);
> > -        return MIG_THR_ERR_NONE;
> > -    }
> > -
> > -    if (local_error) {
> > +    if (ret) {
> > +        /* Passover qemufile errors to s->error */
> > +        assert(local_error);
> >         migrate_set_error(s, local_error);
> >         error_free(local_error);
> >     }
> > 
> > +    if (!migrate_has_error(s)) {
> > +        return MIG_THR_ERR_NONE;
> > +    }
> > +
> >     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
> >         /*
> >          * For postcopy, we allow the network to be down for a
> > @@ -3037,6 +3037,8 @@ static MigThrError 
> > migration_detect_error(MigrationState *s)
> >         /*
> >          * For precopy (or postcopy with error outside IO), we fail
> >          * with no time.
> > +         *
> > +         * TODO: update FAILED only until the end of migration in BH.
> >          */
> >         migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
> >         trace_migration_thread_file_err();
> > ===8<===
> > 
> > I kept a TODO above, I would hope if you reworked everything to route
> > errors to s->error, then we can move this to the cleanup BH to avoid the
> > race.
> > 
> > Do you think that could work?
> 
> I meant: in case of post-copy, what if we have another error somewhere and
> s->error was set, but then we also saw a file error when we called
> qemu_file_get_error_obj_any. In this case, migration should fail IMO but it
> would be paused instead, right?

Yeah you got a point, but I see no good reason to cancel any postcopy
migration, no matter which error it is - either a qemufile error or another
- simply because postcopy cancel means VM crash.  There's nothing worse
that that..

So IMHO we could treat it the same as EIO errors in this case as of now,
and we always pause postcopy no matter which kind of error it hits.  At
least for non-recoverable errors we can have some active process to look
at on src QEMU instance, OTOH there's no direct benefit for us to
differenciate different error cases to crash VM earlier.

Thanks,

-- 
Peter Xu

[Prev in Thread]

Current Thread

[Next in Thread]

[RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Fabiano Rosas, 2025/01/10
  - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/13
    - Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/22
    - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/22
    - Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/23
    - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu <=
    - Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/23
    - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23

Prev by Date: [PATCH] rust/zeroable: Implement Zeroable with const_zero macro
Next by Date: Re: [PATCH v4 0/4] overcommit: introduce mem-lock-onfault
Previous by thread: Re: [RFC PATCH] Fix race in live migration failure path
Next by thread: Re: [RFC PATCH] Fix race in live migration failure path
Index(es):
- Date
- Thread