[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH] Fix race in live migration failure path
From: |
Shivam Kumar |
Subject: |
Re: [RFC PATCH] Fix race in live migration failure path |
Date: |
Thu, 23 Jan 2025 09:53:16 +0000 |
> On 22 Jan 2025, at 10:10 PM, Peter Xu <peterx@redhat.com> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> Hi, Shivam,
>
> On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote:
>> There is one place where we set the migration status to FAILED but we don’t
>> set
>> s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state but
>> not sure setting s->error in this case is possible (or required), as it
>> seems a
>> different workflow to me.
>
> Yes it's very different indeed. I may not yet fully get the challenge on
> how switching to s->error (implies FAILING) would affect this one, though.
> I mean, in qemu_savevm_state() it tries to fetch qemufile errors with:
>
> ret = qemu_file_get_error(f);
>
> IIUC we could also try to fetch s->error just like what migration thread
> does, and if it sets means it's failing?
Yes, I can just set s->error in qemu_savevm_state.
@@ -1686,7 +1686,6 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
- status = MIGRATION_STATUS_FAILED;
+ s->error = *errp;
But my question was: is migrate_fd_cleanup called (where we will set the status
to FAILED if s->error is set) in the snapshot workflow?
>
>>
>> In addition, one potentially real problem that I see is this comment in
>> migration_detect_error:
>> /*
>> * For postcopy, we allow the network to be down for a
>> * while. After that, it can be continued by a
>> * recovery phase.
>> */
>> Let's say if we set s->error at some place and there was a file error on
>> either
>> source or destination (qemu_file_get_error_obj_any returns a positive value
>
> This is trivial, but I suppose you meant s/positive/negative/ here.. as
> qemufile's last_error should always be negative, iiuc.
>
>> when called by migration_detect_error). We expect migration to fail in this
>> case but migration will continue to run since post-copy migration is tolerant
>> to file errors?
>
> Yes it can halt at postcopy_pause(). I'm not yet understand why it's an
> issue to using s->error, though.
>
> In general, I'm thinking whether we could also check s->error in detect
> error path like this:
>
> ===8<===
> diff --git a/migration/migration.c b/migration/migration.c
> index 2d1da917c7..fbd97395e0 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3015,17 +3015,17 @@ static MigThrError
> migration_detect_error(MigrationState *s)
> ret = qemu_file_get_error_obj_any(s->to_dst_file,
> s->postcopy_qemufile_src,
> &local_error);
> - if (!ret) {
> - /* Everything is fine */
> - assert(!local_error);
> - return MIG_THR_ERR_NONE;
> - }
> -
> - if (local_error) {
> + if (ret) {
> + /* Passover qemufile errors to s->error */
> + assert(local_error);
> migrate_set_error(s, local_error);
> error_free(local_error);
> }
>
> + if (!migrate_has_error(s)) {
> + return MIG_THR_ERR_NONE;
> + }
> +
> if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
> /*
> * For postcopy, we allow the network to be down for a
> @@ -3037,6 +3037,8 @@ static MigThrError
> migration_detect_error(MigrationState *s)
> /*
> * For precopy (or postcopy with error outside IO), we fail
> * with no time.
> + *
> + * TODO: update FAILED only until the end of migration in BH.
> */
> migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
> trace_migration_thread_file_err();
> ===8<===
>
> I kept a TODO above, I would hope if you reworked everything to route
> errors to s->error, then we can move this to the cleanup BH to avoid the
> race.
>
> Do you think that could work?
I meant: in case of post-copy, what if we have another error somewhere and
s->error was set, but then we also saw a file error when we called
qemu_file_get_error_obj_any. In this case, migration should fail IMO but it
would be paused instead, right?
- [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Fabiano Rosas, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/13
- Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/22
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/22
- Re: [RFC PATCH] Fix race in live migration failure path,
Shivam Kumar <=
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23
- Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/23
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23