qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] mirror: Avoid assertion failed in mirror_run


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH] mirror: Avoid assertion failed in mirror_run
Date: Thu, 9 Dec 2021 19:33:42 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0

08.12.2021 12:52, wang.yi59@zte.com.cn wrote:
[CC-ing qemu-block, Vladimir, Kevin, and John – when sending patches,
please look into the MAINTAINERS file or use the
scripts/get_maintainer.pl script to find out who to CC on them.  It’s
very to overlook patches on qemu-devel :/]

On 07.12.21 11:56, Yi Wang wrote:
From: Long YunJian <long.yunjian@zte.com.cn>

when blockcommit from active leaf node, sometimes, we get assertion failed with
"mirror_run: Assertion `QLIST_EMPTY(&bs->tracked_requests)' failed" messages.
According to the core file, we find bs->tracked_requests has IO request,
so assertion failed.
(gdb) bt
#0  0x00007f410df707cf in raise () from /lib64/libc.so.6
#1  0x00007f410df5ac05 in abort () from /lib64/libc.so.6
#2  0x00007f410df5aad9 in __assert_fail_base.cold.0 () from /lib64/libc..so.6
#3  0x00007f410df68db6 in __assert_fail () from /lib64/libc.so.6
#4  0x0000556915635371 in mirror_run (job=0x556916ff8600, errp=<optimized out>) 
at block/mirror.c:1092
#5  0x00005569155e6c53 in job_co_entry (opaque=0x556916ff8600) at job..c:904
#6  0x00005569156d9483 in coroutine_trampoline (i0=<optimized out>, i1=<optimized 
out>) at util/coroutine-ucontext.c:115
(gdb) p s->mirror_top_bs->backing->bs->tracked_requests
$12 = {lh_first = 0x7f3f07bfb8b0}
(gdb) p s->mirror_top_bs->backing->bs->tracked_requests->lh_first
$13 = (struct BdrvTrackedRequest *) 0x7f3f07bfb8b0

Actually, before excuting assert(QLIST_EMPTY(&bs->tracked_requests)),
it will excute mirror_flush(s). It may handle new I/O request and maybe
pending I/O during this flush. Just likes in bdrv_close fuction,
bdrv_drain(bs) followed by bdrv_flush(bs), we should add bdrv_drain fuction
to handle pending I/O after mirror_flush.

Oh.  How is that happening, though?  I would have expected that flushing
the target BB (and associated BDS) only flushes requests to the OS and
lower layers, but the source node (which is `bs`) should (in the case of
commit) always be above the target, so I wouldn’t have expected it to
get any new requests due to this flush.

Do you have a reproducer for this?

As i know, flush maybe will do some thring write, and then in qcow2_co_pwritev 
function,
if others aready hold "s->lock" lock, qemu_co_mutex_lock(&s->lock) will go to 
qemu_coroutine_yield,
and do some other things. Maybe, it will handle new I/O now.

No, they must not, as we are in a drained section.. All possible producers of 
new io requests should be aware of it and should not create new requests. 
Still, the history knows bugs, when requests were created during drained 
section, look at cf3129323f900ef5ddbccbe8 commit.

So, if in drained section (after bdrv_drain_begin() call returned) we see 
something in bs->tracked_requests - that's probably a deeper bug, and we 
shouldn't try to mask it by additional bdrv_drain(). bdrv_drain() inside a drained 
section for same bs should be a no-op.

Could you investigate a bit more? The simplest thing to do is to look at this 
tracked request coroutine, it may help to catch the source of this request. To 
do this, you can use scripts/qemu-gdb.py's coroutine command that shows 
backtrace for coroutine. Unfortunately it doesn't work for coredumps, only for 
alive process.

So, you'll need:

1. start your vm
2. attach with gdb to qemu process, and in gdb do "source 
/path/to/qemu/scripts/qemu-gdb.py"
3. do the reproduce
4. In gdb, run command "qemu coroutine COROUTINE_POINTER". And COROUTINE_POINTER you'll find 
inside s->mirror_top_bs->backing->bs->tracked_requests->lh_first, it is its .co field.

It should print back-trace of the coroutine.

Another approach could be try to set a breakpoint on adding an element to tracked_requests with 
a condition that bs->quiesce_counter > 0  (which is, as I understand, a kind of 
"drain counter" actually)

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]