qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: recent flakiness (intermittent hangs) of migration-test


From: Peter Maydell
Subject: Re: recent flakiness (intermittent hangs) of migration-test
Date: Fri, 30 Oct 2020 11:48:28 +0000

On Thu, 29 Oct 2020 at 20:28, Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Oct 29, 2020 at 07:34:33PM +0000, Dr. David Alan Gilbert wrote:
> > > Here's qemu process 3514:
> > > Thread 5 (Thread 0x3ff4affd910 (LWP 3628)):
> > > #0  0x000003ff94c8d936 in futex_wait_cancelable (private=<optimized
> > > out>, expected=0, futex_word=0x2aa26cd74dc)
> > >     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> > > #1  0x000003ff94c8d936 in __pthread_cond_wait_common (abstime=0x0,
> > > mutex=0x2aa26cd7488, cond=0x2aa26cd74b0)
> > >     at pthread_cond_wait.c:502
> > > #2  0x000003ff94c8d936 in __pthread_cond_wait
> > > (cond=cond@entry=0x2aa26cd74b0, mutex=mutex@entry=0x2aa26cd7488)
> > >     at pthread_cond_wait.c:655
> > > #3  0x000002aa2497072c in qemu_sem_wait (sem=sem@entry=0x2aa26cd7488)
> > > at ../../util/qemu-thread-posix.c:328
> > > #4  0x000002aa244f4a02 in postcopy_pause (s=0x2aa26cd7000) at
> > > ../../migration/migration.c:3192
>
> So the postcopy pause state didn't continue successfully on src due to some
> reason ...
>
> > > #5  0x000002aa244f4a02 in migration_detect_error (s=0x2aa26cd7000) at
> > > ../../migration/migration.c:3255
> > > #6  0x000002aa244f4a02 in migration_thread
> > > (opaque=opaque@entry=0x2aa26cd7000) at
> > > ../../migration/migration.c:3564
> > > #7  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
> > > ../../util/qemu-thread-posix.c:521
> > > #8  0x000003ff94c87aa8 in start_thread (arg=0x3ff4affd910) at
> > > pthread_create.c:463
> > > #9  0x000003ff94b7a896 in thread_start () at
> > > ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
>
> [...]
>
> > > And here's 3528:
> > > Thread 6 (Thread 0x3ff6ccfd910 (LWP 3841)):
> > > #0  0x000003ffb1b8d936 in futex_wait_cancelable (private=<optimized
> > > out>, expected=0, futex_word=0x2aa387a6aac)
> > >     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> > > #1  0x000003ffb1b8d936 in __pthread_cond_wait_common (abstime=0x0,
> > > mutex=0x2aa387a6a58, cond=0x2aa387a6a80)
> > >     at pthread_cond_wait.c:502
> > > #2  0x000003ffb1b8d936 in __pthread_cond_wait
> > > (cond=cond@entry=0x2aa387a6a80, mutex=mutex@entry=0x2aa387a6a58)
> > >     at pthread_cond_wait.c:655
> > > #3  0x000002aa36bf072c in qemu_sem_wait (sem=sem@entry=0x2aa387a6a58)
> > > at ../../util/qemu-thread-posix.c:328
> > > #4  0x000002aa366c369a in postcopy_pause_incoming (mis=<optimized
> > > out>) at ../../migration/savevm.c:2541
>
> Same on the destination side.
>
> > > #5  0x000002aa366c369a in qemu_loadvm_state_main
> > > (f=f@entry=0x2aa38897930, mis=mis@entry=0x2aa387a6820)
> > >     at ../../migration/savevm.c:2615
> > > #6  0x000002aa366c44fa in postcopy_ram_listen_thread
> > > (opaque=opaque@entry=0x0) at ../../migration/savevm.c:1830
> > > #7  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
> > > ../../util/qemu-thread-posix.c:521
> > > #8  0x000003ffb1b87aa8 in start_thread (arg=0x3ff6ccfd910) at
> > > pthread_create.c:463
> > > #9  0x000003ffb1a7a896 in thread_start () at
> > > ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
>
> Peter, is it possible that you enable QTEST_LOG=1 in your future 
> migration-test
> testcase and try to capture the stderr?  With the help of commit a47295014d
> ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test should
> be able to dump quite some helpful information to further identify the issue.

Here's the result of running just the migration test with
QTEST_LOG=1:
https://people.linaro.org/~peter.maydell/migration.log
It's 300MB because when the test hangs one of the processes
is apparently in a polling state and continues to send status
queries.

My impression is that the test is OK on an unloaded machine but
more likely to fail if the box is doing other things at the
same time. Alternatively it might be a 'parallel make check' bug.

thanks
-- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]