qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 10/25] migration: Add Error** argument to qemu_savevm_stat


From: Peter Xu
Subject: Re: [PATCH v4 10/25] migration: Add Error** argument to qemu_savevm_state_setup()
Date: Fri, 8 Mar 2024 22:17:50 +0800

On Fri, Mar 08, 2024 at 02:55:30PM +0100, Cédric Le Goater wrote:
> On 3/8/24 14:39, Cédric Le Goater wrote:
> > On 3/8/24 14:14, Cédric Le Goater wrote:
> > > On 3/8/24 13:56, Peter Xu wrote:
> > > > On Wed, Mar 06, 2024 at 02:34:25PM +0100, Cédric Le Goater wrote:
> > > > > This prepares ground for the changes coming next which add an Error**
> > > > > argument to the .save_setup() handler. Callers of 
> > > > > qemu_savevm_state_setup()
> > > > > now handle the error and fail earlier setting the migration state from
> > > > > MIGRATION_STATUS_SETUP to MIGRATION_STATUS_FAILED.
> > > > > 
> > > > > In qemu_savevm_state(), move the cleanup to preserve the error
> > > > > reported by .save_setup() handlers.
> > > > > 
> > > > > Since the previous behavior was to ignore errors at this step of
> > > > > migration, this change should be examined closely to check that
> > > > > cleanups are still correctly done.
> > > > > 
> > > > > Signed-off-by: Cédric Le Goater <clg@redhat.com>
> > > > > ---
> > > > > 
> > > > >   Changes in v4:
> > > > >   - Merged cleanup change in qemu_savevm_state()
> > > > >   Changes in v3:
> > > > >   - Set migration state to MIGRATION_STATUS_FAILED
> > > > >   - Fixed error handling to be done under lock in 
> > > > > bg_migration_thread()
> > > > >   - Made sure an error is always set in case of failure in
> > > > >     qemu_savevm_state_setup()
> > > > >   migration/savevm.h    |  2 +-
> > > > >   migration/migration.c | 27 ++++++++++++++++++++++++---
> > > > >   migration/savevm.c    | 26 +++++++++++++++-----------
> > > > >   3 files changed, 40 insertions(+), 15 deletions(-)
> > > > > 
> > > > > diff --git a/migration/savevm.h b/migration/savevm.h
> > > > > index 
> > > > > 74669733dd63a080b765866c703234a5c4939223..9ec96a995c93a42aad621595f0ed58596c532328
> > > > >  100644
> > > > > --- a/migration/savevm.h
> > > > > +++ b/migration/savevm.h
> > > > > @@ -32,7 +32,7 @@
> > > > >   bool qemu_savevm_state_blocked(Error **errp);
> > > > >   void qemu_savevm_non_migratable_list(strList **reasons);
> > > > >   int qemu_savevm_state_prepare(Error **errp);
> > > > > -void qemu_savevm_state_setup(QEMUFile *f);
> > > > > +int qemu_savevm_state_setup(QEMUFile *f, Error **errp);
> > > > >   bool qemu_savevm_state_guest_unplug_pending(void);
> > > > >   int qemu_savevm_state_resume_prepare(MigrationState *s);
> > > > >   void qemu_savevm_state_header(QEMUFile *f);
> > > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > > index 
> > > > > a49fcd53ee19df1ce0182bc99d7e064968f0317b..6d1544224e96f5edfe56939a9c8395d88ef29581
> > > > >  100644
> > > > > --- a/migration/migration.c
> > > > > +++ b/migration/migration.c
> > > > > @@ -3408,6 +3408,8 @@ static void *migration_thread(void *opaque)
> > > > >       int64_t setup_start = qemu_clock_get_ms(QEMU_CLOCK_HOST);
> > > > >       MigThrError thr_error;
> > > > >       bool urgent = false;
> > > > > +    Error *local_err = NULL;
> > > > > +    int ret;
> > > > >       thread = migration_threads_add("live_migration", 
> > > > > qemu_get_thread_id());
> > > > > @@ -3451,9 +3453,17 @@ static void *migration_thread(void *opaque)
> > > > >       }
> > > > >       bql_lock();
> > > > > -    qemu_savevm_state_setup(s->to_dst_file);
> > > > > +    ret = qemu_savevm_state_setup(s->to_dst_file, &local_err);
> > > > >       bql_unlock();
> > > > > +    if (ret) {
> > > > > +        migrate_set_error(s, local_err);
> > > > > +        error_free(local_err);
> > > > > +        migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > > > > +                          MIGRATION_STATUS_FAILED);
> > > > > +        goto out;
> > > > > +     }
> > > > 
> > > > There's a small indent issue, I can fix it.
> > > 
> > > checkpatch did report anything.
> > > 
> > > > 
> > > > The bigger problem is I _think_ this will trigger a ci failure in the
> > > > virtio-net-failover test:
> > > > 
> > > > ▶ 121/464 
> > > > ERROR:../tests/qtest/virtio-net-failover.c:1203:test_migrate_abort_wait_unplug:
> > > >  assertion failed (status == "cancelling"): ("cancelled" == 
> > > > "cancelling") ERROR
> > > > 121/464 qemu:qtest+qtest-x86_64 / qtest-x86_64/virtio-net-failover    
> > > > ERROR            4.77s   killed by signal 6 SIGABRT
> > > > > > > PYTHON=/builds/peterx/qemu/build/pyvenv/bin/python3.8 
> > > > > > > G_TEST_DBUS_DAEMON=/builds/peterx/qemu/tests/dbus-vmstate-daemon.sh
> > > > > > >  MALLOC_PERTURB_=161 QTEST_QEMU_IMG=./qemu-img 
> > > > > > > QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon
> > > > > > >  QTEST_QEMU_BINARY=./qemu-system-x86_64 
> > > > > > > /builds/peterx/qemu/build/tests/qtest/virtio-net-failover --tap -k
> > > > ――――――――――――――――――――――――――――――――――――― ✀  
> > > > ―――――――――――――――――――――――――――――――――――――
> > > > stderr:
> > > > qemu-system-x86_64: ram_save_setup failed: Input/output error
> > > > **
> > > > ERROR:../tests/qtest/virtio-net-failover.c:1203:test_migrate_abort_wait_unplug:
> > > >  assertion failed (status == "cancelling"): ("cancelled" == 
> > > > "cancelling")
> > > > (test program exited with status code -6)
> > > > ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> > > > 
> > > > I am not familiar enough with the failover code, and may not have time
> > > > today to follow this up, copy Laurent.  Cedric, if you have time, please
> > > > have a look.
> > > 
> > > 
> > > Sure. Weird because I usually run make check on x86_64, s390x, ppc64 and
> > > aarch64. Let me check again.
> > 
> > I see one timeout error on s390x but not always. See below. It occurs with
> > or without this patchset. the other x86_64, ppc64 arches run fine (a part
> > from one io  test failing from time to time)
> 
> Ah ! I got this once on aarch64 :
> 
>  161/486 
> ERROR:../tests/qtest/virtio-net-failover.c:1222:test_migrate_abort_wait_unplug:
>  'device' should not be NULL ERROR
> 161/486 qemu:qtest+qtest-x86_64 / qtest-x86_64/virtio-net-failover            
>       ERROR            5.98s   killed by signal 6 SIGABRT
> > > > G_TEST_DBUS_DAEMON=/home/legoater/work/qemu/qemu.git/tests/dbus-vmstate-daemon.sh
> > > >  MALLOC_PERTURB_=119 QTEST_QEMU_BINARY=./qemu-system-x86_64 
> > > > QTEST_QEMU_IMG=./qemu-img 
> > > > PYTHON=/home/legoater/work/qemu/qemu.git/build/pyvenv/bin/python3 
> > > > QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
> > > > /home/legoater/work/qemu/qemu.git/build/tests/qtest/virtio-net-failover 
> > > > --tap -k
> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ✀  
> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-x86_64: ram_save_setup failed: Input/output error
> **
> ERROR:../tests/qtest/virtio-net-failover.c:1222:test_migrate_abort_wait_unplug:
>  'device' should not be NULL
> 
> (test program exited with status code -6)
> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

Hmm, this one seems different..

> 
> I couldn't reproduce yet :/

I never reproduced it locally on x86, and my failure is always at checking
"cancelling" v.s. "cancelled" rather than the NULL check.  It's much easier
to trigger on CI in check-system-centos (I don't know why centos..):

https://gitlab.com/peterx/qemu/-/jobs/6351020546

I think at least for the error I hit, the problem is the failover test will
cancel the migration, but if it cancels too fast and during setup now it
can already fail it (while it won't fail before when we ignore
qemu_savevm_state_setup() errors), and I think it'll skip:

    qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
                               MIGRATION_STATUS_ACTIVE);

It seems the test wants the "cancelling" to hold until later:

    /* while the card is not ejected, we must be in "cancelling" state */
    ret = migrate_status(qts);

    status = qdict_get_str(ret, "status");
    g_assert_cmpstr(status, ==, "cancelling");
    qobject_unref(ret);

    /* OS unplugs the cards, QEMU can move from wait-unplug state */
    qtest_outl(qts, ACPI_PCIHP_ADDR_ICH9 + PCI_EJ_BASE, 1);

Again, since I'll need to read the failover code, not much I can tell.
Laurent might have a clue.

/me disappears..

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]