[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH] blk: postpone request execution on a context pr
From: |
Denis Plotnikov |
Subject: |
Re: [Qemu-block] [PATCH] blk: postpone request execution on a context protected with "drained section" |
Date: |
Tue, 2 Apr 2019 08:35:14 +0000 |
On 13.03.2019 19:04, Kevin Wolf wrote:
> Am 14.12.2018 um 12:54 hat Denis Plotnikov geschrieben:
>> On 13.12.2018 15:20, Kevin Wolf wrote:
>>> Am 13.12.2018 um 12:07 hat Denis Plotnikov geschrieben:
>>>> Sounds it should be so, but it doesn't work that way and that's why:
>>>> when doing mirror we may resume postponed coroutines too early when the
>>>> underlying bs is protected from writing at and thus we encounter the
>>>> assert on a write request execution at bdrv_co_write_req_prepare when
>>>> resuming the postponed coroutines.
>>>>
>>>> The thing is that the bs is protected for writing before execution of
>>>> bdrv_replace_node at mirror_exit_common and bdrv_replace_node calls
>>>> bdrv_replace_child_noperm which, in turn, calls child->role->drained_end
>>>> where one of the callbacks is blk_root_drained_end which check
>>>> if(--blk->quiesce_counter == 0) and runs the postponed requests
>>>> (coroutines) if the coundition is true.
>>>
>>> Hm, so something is messed up with the drain sections in the mirror
>>> driver. We have:
>>>
>>> bdrv_drained_begin(target_bs);
>>> bdrv_replace_node(to_replace, target_bs, &local_err);
>>> bdrv_drained_end(target_bs);
>>>
>>> Obviously, the intention was to keep the BlockBackend drained during
>>> bdrv_replace_node(). So how could blk->quiesce_counter ever get to 0
>>> inside bdrv_replace_node() when target_bs is drained?
>>>
>>> Looking at bdrv_replace_child_noperm(), it seems that the function has
>>> a bug: Even if old_bs and new_bs are both drained, the quiesce_counter
>>> for the parent reaches 0 for a moment because we call .drained_end for
>>> the old child first and .drained_begin for the new one later.
>>>
>>> So it seems the fix would be to reverse the order and first call
>>> .drained_begin for the new child and then .drained_end for the old
>>> child. Sounds like a good new testcase for tests/test-bdrv-drain.c, too.
>> Yes, it's true, but it's not enough...
>
> Did you ever implement the changes suggested so far, so that we could
> continue from there? Or should I try and come up with something myself?
Sorry for the late reply...
Yes, I did ...
>
>> In mirror_exit_common() we actively manipulate with block driver states.
>> When we replaced a node in the snippet you showed we can't allow the
>> postponed coroutines to run because the block tree isn't ready to
>> receive the requests yet.
>> To be ready, we need to insert a proper block driver state to the block
>> backend which is done here
>>
>> blk_remove_bs(bjob->blk);
>> blk_set_perm(bjob->blk, 0, BLK_PERM_ALL, &error_abort);
>> blk_insert_bs(bjob->blk, mirror_top_bs, &error_abort); << << << <<
>>
>> bs_opaque->job = NULL;
>>
>> bdrv_drained_end(src);
>
> Did you actually encounter a bug here or is this just theory? bjob->blk
> is the BlockBackend of the job and isn't in use at this point any more.
> We only insert the old node in it again because block_job_free() must
> set bs->job = NULL, and it gets bs with blk_bs(bjob->blk).
>
> So if there is an actual bug here, I don't understand it yet.
And did encounter the bug that I described above.
When a postponed coroutine resumes it fails on assert:
bdrv_co_write_req_prepare: Assertion `child->perm & BLK_PERM_WRITE' failed
That's why it happens: we have the mirror filter bds in blk root which
receives all the requests. On mirror completion we call
mirror_exit_common to finish mirroring. To finish mirroring we need to
remove the mirror filter from the graph and set mirror file blk root.
We call block_job_complete. Assume the ide request has came after the
completion calling and has been postponed because blk->quiesce_counter
is not 0. block_job_complete does mirror_exit_common which drops the
permissions.
/* We don't access the source any more. Dropping any WRITE/RESIZE is
* required before it could become a backing file of target_bs. */
bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL,
&error_abort);
then, it replaces the source with the target
/* The mirror job has no requests in flight any more, but we
need to
* drain potential other users of the BDS before changing the
graph. */
// here, target_bs has no parents and doesn't begin to draing
bdrv_drained_begin(target_bs);
// after execution of the function below
// target bs has mirror_top_bs->backing as a parent
bdrv_replace_node(to_replace, target_bs, &local_err);
// now target_bs has source's blk as a parent
// the following call sets blk->quiesce_counter to 0
// and executes the postponed coroutine on blk with
// mirror filter set which eventually does writing
// on mirror_top_bs->backing child which has no writing
// (and reading) permissions
bdrv_drained_end(target_bs);
Does it make thing more clear?
Denis
>
>> If the tree isn't ready and we resume the coroutines, we'll end up with
>> the request landed in a wrong block driver state.
>>
>> So, we explicitly should stop all activities on all the driver states
>> and its parents and allow the activities when everything is ready to go.
>>
>> Why explicitly, because the block driver states may belong to different
>> block backends at the moment of the manipulation beginning.
>>
>> So, it seems we need to disable all their contexts until the
>> manipulation ends.
>
> If there actually is a bug, it is certainly not solved by calling
> aio_disable_external() (it is bad enough that this even exists), but by
> keeping the node drained.
>
> Kevin
>
--
Best,
Denis
- Re: [Qemu-block] [PATCH] blk: postpone request execution on a context protected with "drained section",
Denis Plotnikov <=