Re: [RFC PATCH] Fix race in live migration failure path

On 13 Jan 2025, at 9:59 PM, Peter Xu <peterx@redhat.com> wrote:

!-------------------------------------------------------------------|
CAUTION: External Email

|-------------------------------------------------------------------!

On Fri, Jan 10, 2025 at 10:09:38AM -0300, Fabiano Rosas wrote:

Shivam Kumar <shivam.kumar1@nutanix.com> writes:

Even if a live migration fails due to some reason, migration status
should not be set to MIGRATION_STATUS_FAILED until migrate fd cleanup
is done, else the client can trigger another instance of migration
before the cleanup is complete (as it would assume no migration is
active) or reset migration capabilities affecting old migration's
cleanup. Hence, set the status to 'failing' when a migration failure
happens and once the cleanup is complete, set the migration status to
MIGRATION_STATUS_FAILED.

Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
migration/migration.c | 49 +++++++++++++++++++++----------------------
migration/migration.h | 9 ++++++++
migration/multifd.c   | 6 ++----
migration/savevm.c    | 7 +++----
4 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index df61ca4e93..f084f54f6b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1143,8 +1143,9 @@ static bool migration_is_active(void)

migration_is_running() is the one that gates qmp_migrate() and
qmp_migrate_set_capabilities().

{
    MigrationState *s = current_migration;

-    return (s->state == MIGRATION_STATUS_ACTIVE ||
-            s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
+    return ((s->state == MIGRATION_STATUS_ACTIVE ||
+            s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) &&
+            !qatomic_read(&s->failing));
}

static bool migrate_show_downtime(MigrationState *s)
@@ -1439,6 +1440,11 @@ static void migrate_fd_cleanup(MigrationState *s)
                          MIGRATION_STATUS_CANCELLED);
    }

+    if (qatomic_xchg(&s->failing, 0)) {
+        migrate_set_state(&s->state, s->state,
+                          MIGRATION_STATUS_FAILED);
+    }

I hope you've verified that sure every place that used to set FAILED
will also reach migrate_fd_cleanup() eventually.

Also, we probably still need the FAILING state. Otherwise, this will
trip code that expects a state change on failure. Anything that does:

if (state != MIGRATION_STATUS_FOO) {
  ...
}

So I think the change overall should be

-migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
+migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILING);

void migrate_set_state(MigrationStatus *state, MigrationStatus old_state,
                       MigrationStatus new_state)
{
    assert(new_state < MIGRATION_STATUS__MAX);
    if (qatomic_cmpxchg(state, old_state, new_state) == old_state) {
        trace_migrate_set_state(MigrationStatus_str(new_state));

+        if (new_state == MIGRATION_STATUS_FAILING) {
+            qatomic_set(&s->failing, 1);
+        }
        migrate_generate_event(new_state);
    }
}

And we should proably do the same for CANCELLING actually, but there the
(preexisting) issue is actual concurrency, while here it's just
inconsistency in the state.

Yes something like FAILING sounds reasonable. Though since we have
s->error, I wonder whether that's a better place to represent a migration
as "failing" in one place, because otherwise we need to set two places
(both FAILING state, and the s->error) - whenever something fails, we'd
better always update s->error so as to remember what failed, then reported
via query-migrate.

From that POV, s->failing is probably never gonna be needed (due to
s->error being present anyway)? So far, such Error* looks like the best
single point to say that the migration is failing - it also enforces the
Error to be provided whoever wants to set it to failing state.

--
Peter Xu

There is one place where we set the migration status to FAILED but we don’t set

s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state but

not sure setting s->error in this case is possible (or required), as it seems a

different workflow to me.

In addition, one potentially real problem that I see is this comment in

migration_detect_error:

* For postcopy, we allow the network to be down for a

* while. After that, it can be continued by a

* recovery phase.

Let's say if we set s->error at some place and there was a file error on either

source or destination (qemu_file_get_error_obj_any returns a positive value

when called by migration_detect_error). We expect migration to fail in this

case but migration will continue to run since post-copy migration is tolerant

to file errors?

From:	Shivam Kumar
Subject:	Re: [RFC PATCH] Fix race in live migration failure path
Date:	Wed, 22 Jan 2025 10:54:17 +0000