qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: s390 migration crash


From: Peter Xu
Subject: Re: s390 migration crash
Date: Tue, 21 Mar 2023 20:19:00 -0400

On Tue, Mar 21, 2023 at 08:24:37PM +0000, Dr. David Alan Gilbert wrote:
> Hi Peter's,
>   Peter M pointed me to a seg in a migration test in CI; I can reproduce
> it:
>   * On an s390 host

How easy to reproduce?

>   * only as part of a make check - running migration-test by itself
> doesn't trigger for me.
>   * It looks like it's postcopy preempt
> 
> (gdb) bt full
> #0  iov_size (iov=iov@entry=0x2aa00e60670, iov_cnt=<optimized out>) at 
> ../util/iov.c:88
>         len = 13517923312037845750
>         i = 17305
> #1  0x000002aa004d068c in qemu_fflush (f=0x2aa00e58630) at 
> ../migration/qemu-file.c:307
>         local_error = 0x0
> #2  0x000002aa004d0e04 in qemu_fflush (f=<optimized out>) at 
> ../migration/qemu-file.c:297
> #3  0x000002aa00613962 in postcopy_preempt_shutdown_file 
> (s=s@entry=0x2aa00d1b4e0) at ../migration/ram.c:4657
> #4  0x000002aa004e12b4 in migration_completion (s=0x2aa00d1b4e0) at 
> ../migration/migration.c:3469
>         ret = <optimized out>
>         current_active_state = 5
>         must_precopy = 0
>         can_postcopy = 0
>         in_postcopy = true
>         pending_size = 0
>         __func__ = "migration_iteration_run"
>         iter_state = <optimized out>
>         s = 0x2aa00d1b4e0
>         thread = <optimized out>
>         setup_start = <optimized out>
>         thr_error = <optimized out>
>         urgent = <optimized out>
> #5  migration_iteration_run (s=0x2aa00d1b4e0) at ../migration/migration.c:3882
>         must_precopy = 0
>         can_postcopy = 0
>         in_postcopy = true
>         pending_size = 0
>         __func__ = "migration_iteration_run"
>         iter_state = <optimized out>
>         s = 0x2aa00d1b4e0
>         thread = <optimized out>
>         setup_start = <optimized out>
>         thr_error = <optimized out>
>         urgent = <optimized out>
> #6  migration_thread (opaque=opaque@entry=0x2aa00d1b4e0) at 
> ../migration/migration.c:4124
>         iter_state = <optimized out>
>         s = 0x2aa00d1b4e0
> --Type <RET> for more, q to quit, c to continue without paging--
>         thread = <optimized out>
>         setup_start = <optimized out>
>         thr_error = <optimized out>
>         urgent = <optimized out>
> #7  0x000002aa00819b8c in qemu_thread_start (args=<optimized out>) at 
> ../util/qemu-thread-posix.c:541
>         __cancel_buf = 
>             {__cancel_jmp_buf = {{__cancel_jmp_buf = {{__gregs = 
> {4396782422080, 4393751543808, 4397299389454, 4396844235904, 2929182727824, 
> 2929182933488, 4396843986792, 4397299389455, 33679382915066768, 
> 33678512846981306}, __fpregs = {4396774031360, 8392704, 2929182933488, 0, 
> 4396782422272, 2929172491858, 4396774031360, 1}}}, __mask_was_saved = 0}}, 
> __pad = {0x3ffb4a77a60, 0x0, 0x0, 0x0}}
>         __cancel_routine = 0x2aa00819bf0 <qemu_thread_atexit_notify>
>         __not_first_call = <optimized out>
>         start_routine = 0x2aa004e08f0 <migration_thread>
>         arg = 0x2aa00d1b4e0
>         r = <optimized out>
> #8  0x000003ffb7b1e2e6 in start_thread () at /lib64/libc.so.6
> #9  0x000003ffb7aafdbe in thread_start () at /lib64/libc.so.6
> 
> It looks like it's in the preempt test:
> 
> (gdb) where
> #0  0x000003ffb17a0126 in __pthread_kill_implementation () from 
> /lib64/libc.so.6
> #1  0x000003ffb1750890 in raise () from /lib64/libc.so.6
> #2  0x000003ffb172a340 in abort () from /lib64/libc.so.6
> #3  0x000002aa0041c130 in qtest_check_status (s=<optimized out>) at 
> ../tests/qtest/libqtest.c:194
> #4  0x000003ffb1a3b5de in g_hook_list_invoke () from /lib64/libglib-2.0.so.0
> #5  <signal handler called>
> #6  0x000003ffb17a0126 in __pthread_kill_implementation () from 
> /lib64/libc.so.6
> #7  0x000003ffb1750890 in raise () from /lib64/libc.so.6
> #8  0x000003ffb172a340 in abort () from /lib64/libc.so.6
> #9  0x000002aa00420318 in qmp_fd_receive (fd=<optimized out>) at 
> ../tests/qtest/libqmp.c:80
> #10 0x000002aa0041d5ee in qtest_qmp_receive_dict (s=0x2aa01eb2700) at 
> ../tests/qtest/libqtest.c:713
> #11 qtest_qmp_receive (s=0x2aa01eb2700) at ../tests/qtest/libqtest.c:701
> #12 qtest_vqmp (s=s@entry=0x2aa01eb2700, fmt=fmt@entry=0x2aa00487100 "{ 
> 'execute': 'query-migrate' }", ap=ap@entry=0x3ffc247cc68)
>     at ../tests/qtest/libqtest.c:765
> #13 0x000002aa00413f1e in wait_command (who=who@entry=0x2aa01eb2700, 
> command=command@entry=0x2aa00487100 "{ 'execute': 'query-migrate' }")
>     at ../tests/qtest/migration-helpers.c:73
> #14 0x000002aa00414078 in migrate_query (who=who@entry=0x2aa01eb2700) at 
> ../tests/qtest/migration-helpers.c:139
> #15 migrate_query_status (who=who@entry=0x2aa01eb2700) at 
> ../tests/qtest/migration-helpers.c:161
> #16 0x000002aa00414480 in check_migration_status (ungoals=0x0, 
> goal=0x2aa00495c7e "completed", who=0x2aa01eb2700) at 
> ../tests/qtest/migration-helpers.c:177
> #17 wait_for_migration_status (who=0x2aa01eb2700, goal=<optimized out>, 
> ungoals=0x0) at ../tests/qtest/migration-helpers.c:202
> #18 0x000002aa0041300e in migrate_postcopy_complete 
> (from=from@entry=0x2aa01eb2700, to=to@entry=0x2aa01eb3000, 
> args=args@entry=0x3ffc247cf48)
>     at ../tests/qtest/migration-test.c:1137
> #19 0x000002aa004131a4 in test_postcopy_common (args=0x3ffc247cf48) at 
> ../tests/qtest/migration-test.c:1162
> #20 test_postcopy_preempt () at ../tests/qtest/migration-test.c:1178
> 
> Looking at the iov and file it's garbage; so it makes me think this is
> something like a flush on a closed file.

I didn't figure out how that could be closed, but I think there's indeed a
possible race that the qemufile can be accessed by both the return path
thread and the migration thread concurrently, while qemufile is not thread
safe on that.

What postcopy_preempt_shutdown_file() does was: the src uses this EOS to
kick the dest QEMU preempt thread out of the migration and shut it off.
After some thought I think this is unnecessary complexity, since postcopy
should end at the point where dest received all the data, then it sends a
SHUT to src.  So potentially it's not good to have dest relying on anything
from src to shutdown anything (the preempt thread here) because it's the
dest qemu that makes the final decision to finish.  Ideally the preempt
thread on dest should be able to shutdown itself.

The trick here is preempt thread will block at read() (aka, recvmsg()) at
the channel at that time and the only way to kick it out from that is a
shutdown() on dest.  I attached a patch did it.  I'm not 100% sure whether
it'll already resolve our problem but worth trying.  This also made me
notice we forgot to enable SHUTDOWN feature on tls server when I was
running the patch 1 with qtest, so two patches needed.

-- 
Peter Xu

Attachment: 0001-io-tls-Inherit-QIO_CHANNEL_FEATURE_SHUTDOWN-on-serve.patch
Description: Text document

Attachment: 0002-migration-Fix-potential-race-on-postcopy_qemufile_sr.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]