[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH] Fix race in live migration failure path
From: |
Peter Xu |
Subject: |
Re: [RFC PATCH] Fix race in live migration failure path |
Date: |
Wed, 22 Jan 2025 11:40:10 -0500 |
Hi, Shivam,
On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote:
> There is one place where we set the migration status to FAILED but we don’t
> set
> s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state but
> not sure setting s->error in this case is possible (or required), as it seems
> a
> different workflow to me.
Yes it's very different indeed. I may not yet fully get the challenge on
how switching to s->error (implies FAILING) would affect this one, though.
I mean, in qemu_savevm_state() it tries to fetch qemufile errors with:
ret = qemu_file_get_error(f);
IIUC we could also try to fetch s->error just like what migration thread
does, and if it sets means it's failing?
>
> In addition, one potentially real problem that I see is this comment in
> migration_detect_error:
> /*
> * For postcopy, we allow the network to be down for a
> * while. After that, it can be continued by a
> * recovery phase.
> */
> Let's say if we set s->error at some place and there was a file error on
> either
> source or destination (qemu_file_get_error_obj_any returns a positive value
This is trivial, but I suppose you meant s/positive/negative/ here.. as
qemufile's last_error should always be negative, iiuc.
> when called by migration_detect_error). We expect migration to fail in this
> case but migration will continue to run since post-copy migration is tolerant
> to file errors?
Yes it can halt at postcopy_pause(). I'm not yet understand why it's an
issue to using s->error, though.
In general, I'm thinking whether we could also check s->error in detect
error path like this:
===8<===
diff --git a/migration/migration.c b/migration/migration.c
index 2d1da917c7..fbd97395e0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3015,17 +3015,17 @@ static MigThrError
migration_detect_error(MigrationState *s)
ret = qemu_file_get_error_obj_any(s->to_dst_file,
s->postcopy_qemufile_src,
&local_error);
- if (!ret) {
- /* Everything is fine */
- assert(!local_error);
- return MIG_THR_ERR_NONE;
- }
-
- if (local_error) {
+ if (ret) {
+ /* Passover qemufile errors to s->error */
+ assert(local_error);
migrate_set_error(s, local_error);
error_free(local_error);
}
+ if (!migrate_has_error(s)) {
+ return MIG_THR_ERR_NONE;
+ }
+
if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
/*
* For postcopy, we allow the network to be down for a
@@ -3037,6 +3037,8 @@ static MigThrError migration_detect_error(MigrationState
*s)
/*
* For precopy (or postcopy with error outside IO), we fail
* with no time.
+ *
+ * TODO: update FAILED only until the end of migration in BH.
*/
migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
trace_migration_thread_file_err();
===8<===
I kept a TODO above, I would hope if you reworked everything to route
errors to s->error, then we can move this to the cleanup BH to avoid the
race.
Do you think that could work?
Thanks,
--
Peter Xu
- [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Fabiano Rosas, 2025/01/10
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/13
- Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/22
- Re: [RFC PATCH] Fix race in live migration failure path,
Peter Xu <=
- Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/23
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23
- Re: [RFC PATCH] Fix race in live migration failure path, Shivam Kumar, 2025/01/23
- Re: [RFC PATCH] Fix race in live migration failure path, Peter Xu, 2025/01/23