Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race

On Fri, Sep 13, 2024 at 3:07 PM Fabiano Rosas <farosas@suse.de> wrote:

Fix a segmentation fault in multifd when rb->receivedmap is cleared
too early.

After commit 5ef7e26bdb ("migration/multifd: solve zero page causing
multiple page faults"), multifd started using the rb->receivedmap
bitmap, which belongs to ram.c and is initialized and *freed* from the
ram SaveVMHandlers.

Multifd threads are live until migration_incoming_state_destroy(),
which is called after qemu_loadvm_state_cleanup(), leading to a crash
when accessing rb->receivedmap.

process_incoming_migration_co() ...
qemu_loadvm_state() multifd_nocomp_recv()
qemu_loadvm_state_cleanup() ramblock_recv_bitmap_set_offset()
rb->receivedmap = NULL set_bit_atomic(..., rb->receivedmap)
...
migration_incoming_state_destroy()
multifd_recv_cleanup()
multifd_recv_terminate_threads(NULL)

Move the loadvm cleanup into migration_incoming_state_destroy(), after
multifd_recv_cleanup() to ensure multifd thread have already exited
when rb->receivedmap is cleared.

The have_listen_thread logic can now be removed because its purpose
was to delay cleanup until postcopy_ram_listen_thread() had finished.

CC: qemu-stable@nongnu.org
Fixes: 5ef7e26bdb ("migration/multifd: solve zero page causing multiple page faults")
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/migration.c | 1 +
migration/migration.h | 1 -
migration/savevm.c | 9 ---------
3 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 3dea06d577..b190a574b1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -378,6 +378,7 @@ void migration_incoming_state_destroy(void)
struct MigrationIncomingState *mis = migration_incoming_get_current();

multifd_recv_cleanup();
+ qemu_loadvm_state_cleanup();

if (mis->to_src_file) {
/* Tell source that we are done */
diff --git a/migration/migration.h b/migration/migration.h
index 38aa1402d5..20b0a5b66e 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -101,7 +101,6 @@ struct MigrationIncomingState {
/* Set this when we want the fault thread to quit */
bool fault_thread_quit;

- bool have_listen_thread;
QemuThread listen_thread;

/* For the kernel to send us notifications */
diff --git a/migration/savevm.c b/migration/savevm.c
index d0759694fd..532ee5e4b0 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2076,10 +2076,8 @@ static void *postcopy_ram_listen_thread(void *opaque)
* got a bad migration state).
*/
migration_incoming_state_destroy();
- qemu_loadvm_state_cleanup();

rcu_unregister_thread();
- mis->have_listen_thread = false;
postcopy_state_set(POSTCOPY_INCOMING_END);

object_unref(OBJECT(migr));
@@ -2130,7 +2128,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
return -1;
}

- mis->have_listen_thread = true;
postcopy_thread_create(mis, &mis->listen_thread, "mig/dst/listen",
postcopy_ram_listen_thread, QEMU_THREAD_DETACHED);
trace_loadvm_postcopy_handle_listen("return");
@@ -2978,11 +2975,6 @@ int qemu_loadvm_state(QEMUFile *f)

trace_qemu_loadvm_state_post_main(ret);

- if (mis->have_listen_thread) {
- /* Listen thread still going, can't clean up yet */
- return ret;
- }
-
if (ret == 0) {
ret = qemu_file_get_error(f);
}
@@ -3022,7 +3014,6 @@ int qemu_loadvm_state(QEMUFile *f)
}
}

- qemu_loadvm_state_cleanup();
cpu_synchronize_all_post_init();

Hi Fabiano

I have a question. By removing qemu_loadvm_state_cleanup() here, the failure path that ends up with exit(EXIT_FAILURE)

in process_incoming_migration_co() end up not calling the qemu_loadvm_state_cleanup(). I am not sure how this is important since there is exit, but the

vfio, for example, will not call the VF reset.

Another more general question is why destination Qemu has to terminate there if there was an error detected during live migration?

Could just failing the migration and leave destination running be a more expected scenario?

Thank you!

return ret;
--
2.35.3

Elena

From:	Elena Ufimtseva
Subject:	Re: [PATCH 2/2] migration/multifd: Fix rb->receivedmap cleanup race
Date:	Fri, 20 Sep 2024 11:55:30 -0700