[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH] migration/postcopy: Fix high frequency sync
From: |
peterx |
Subject: |
[PATCH] migration/postcopy: Fix high frequency sync |
Date: |
Wed, 20 Mar 2024 17:44:53 -0400 |
From: Peter Xu <peterx@redhat.com>
On current code base I can observe extremely high sync count during
precopy, as long as one enables postcopy-ram=on before switchover to
postcopy.
To provide some context of when we decide to do a full sync: we check
must_precopy (which implies "data must be sent during precopy phase"), and
as long as it is lower than the threshold size we calculated (out of
bandwidth and expected downtime) we will kick off the slow sync.
However, when postcopy is enabled (even if still during precopy phase), RAM
only reports all pages as can_postcopy, and report must_precopy==0. Then
"must_precopy <= threshold_size" mostly always triggers and enforces a slow
sync for every call to migration_iteration_run() when postcopy is enabled
even if not used. That is insane.
It turns out it was a regress bug introduced in the previous refactoring in
QEMU 8.0 in late 2022. Fix this by checking the whole RAM size rather than
must_precopy, like before. Not copy stable yet as many things changed, and
even if this should be a major performance regression, no functional change
has observed (and that's also probably why nobody found it). I only notice
this when looking for another bug reported by Nina.
When at it, cleanup a little bit on the lines around.
Cc: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Fixes: c8df4a7aef ("migration: Split save_live_pending() into state_pending_*")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
Nina: I copied you only because this might still be relevant, as this issue
also misteriously points back to c8df4a7aef.. However I don't think it
should be a fix of your problem, at most it can change the possibility of
reproducability.
This is not a regression for this release, but I still want to have it for
9.0. Fabiano, any opinions / objections?
---
migration/migration.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 047b6b49cf..9fe8fd2afd 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3199,17 +3199,16 @@ typedef enum {
*/
static MigIterateState migration_iteration_run(MigrationState *s)
{
- uint64_t must_precopy, can_postcopy;
+ uint64_t must_precopy, can_postcopy, pending_size;
Error *local_err = NULL;
bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
bool can_switchover = migration_can_switchover(s);
qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
- uint64_t pending_size = must_precopy + can_postcopy;
-
+ pending_size = must_precopy + can_postcopy;
trace_migrate_pending_estimate(pending_size, must_precopy, can_postcopy);
- if (must_precopy <= s->threshold_size) {
+ if (pending_size < s->threshold_size) {
qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
pending_size = must_precopy + can_postcopy;
trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
--
2.44.0
- [PATCH] migration/postcopy: Fix high frequency sync,
peterx <=