Re: [RFC PATCH] Fix race in live migration failure path

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH] Fix race in live migration failure path

From:	Shivam Kumar
Subject:	Re: [RFC PATCH] Fix race in live migration failure path
Date:	Thu, 23 Jan 2025 09:53:16 +0000


> On 22 Jan 2025, at 10:10 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Hi, Shivam,
> 
> On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote:
>> There is one place where we set the migration status to FAILED but we don’t 
>> set
>> s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state but
>> not sure setting s->error in this case is possible (or required), as it 
>> seems a
>> different workflow to me.
> 
> Yes it's very different indeed.  I may not yet fully get the challenge on
> how switching to s->error (implies FAILING) would affect this one, though.
> I mean, in qemu_savevm_state() it tries to fetch qemufile errors with:
> 
>    ret = qemu_file_get_error(f);
> 
> IIUC we could also try to fetch s->error just like what migration thread
> does, and if it sets means it's failing?
Yes, I can just set s->error in qemu_savevm_state.
@@ -1686,7 +1686,6 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
-        status = MIGRATION_STATUS_FAILED;
+       s->error = *errp;
But my question was: is migrate_fd_cleanup called (where we will set the status
to FAILED if s->error is set) in the snapshot workflow? 
> 
>> 
>> In addition, one potentially real problem that I see is this comment in
>> migration_detect_error:
>> /*
>> * For postcopy, we allow the network to be down for a
>> * while. After that, it can be continued by a
>> * recovery phase.
>> */
>> Let's say if we set s->error at some place and there was a file error on 
>> either
>> source or destination (qemu_file_get_error_obj_any returns a positive value
> 
> This is trivial, but I suppose you meant s/positive/negative/ here.. as
> qemufile's last_error should always be negative, iiuc.
> 
>> when called by migration_detect_error). We expect migration to fail in this
>> case but migration will continue to run since post-copy migration is tolerant
>> to file errors?
> 
> Yes it can halt at postcopy_pause().  I'm not yet understand why it's an
> issue to using s->error, though.
> 
> In general, I'm thinking whether we could also check s->error in detect
> error path like this:
> 
> ===8<===
> diff --git a/migration/migration.c b/migration/migration.c
> index 2d1da917c7..fbd97395e0 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3015,17 +3015,17 @@ static MigThrError 
> migration_detect_error(MigrationState *s)
>     ret = qemu_file_get_error_obj_any(s->to_dst_file,
>                                       s->postcopy_qemufile_src,
>                                       &local_error);
> -    if (!ret) {
> -        /* Everything is fine */
> -        assert(!local_error);
> -        return MIG_THR_ERR_NONE;
> -    }
> -
> -    if (local_error) {
> +    if (ret) {
> +        /* Passover qemufile errors to s->error */
> +        assert(local_error);
>         migrate_set_error(s, local_error);
>         error_free(local_error);
>     }
> 
> +    if (!migrate_has_error(s)) {
> +        return MIG_THR_ERR_NONE;
> +    }
> +
>     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
>         /*
>          * For postcopy, we allow the network to be down for a
> @@ -3037,6 +3037,8 @@ static MigThrError 
> migration_detect_error(MigrationState *s)
>         /*
>          * For precopy (or postcopy with error outside IO), we fail
>          * with no time.
> +         *
> +         * TODO: update FAILED only until the end of migration in BH.
>          */
>         migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
>         trace_migration_thread_file_err();
> ===8<===
> 
> I kept a TODO above, I would hope if you reworked everything to route
> errors to s->error, then we can move this to the cleanup BH to avoid the
> race.
> 
> Do you think that could work?

I meant: in case of post-copy, what if we have another error somewhere and
s->error was set, but then we also saw a file error when we called
qemu_file_get_error_obj_any. In this case, migration should fail IMO but it
would be paused instead, right?

[Prev in Thread]

Current Thread

[Next in Thread]

[RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Fabiano Rosas, 2025/01/10
  - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/13
    - Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/22
    - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/22
    - Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar <=
    - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23
    - Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/23
    - Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23

Prev by Date: Re: [PATCH 2/2] hw/cxl: Allow tracing component I/O accesses
Next by Date: Re: [PATCH 01/11] acpi/ghes: Prepare to support multiple sources on ghes
Previous by thread: Re: [RFC PATCH] Fix race in live migration failure path
Next by thread: Re: [RFC PATCH] Fix race in live migration failure path
Index(es):
- Date
- Thread