qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->chi


From: Emanuele Giuseppe Esposito
Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock
Date: Wed, 18 May 2022 14:28:41 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0


Am 17/05/2022 um 12:59 schrieb Stefan Hajnoczi:
> On Wed, May 04, 2022 at 02:39:05PM +0100, Stefan Hajnoczi wrote:
>> On Tue, Apr 26, 2022 at 04:51:06AM -0400, Emanuele Giuseppe Esposito wrote:
>>> This is a new attempt to replace the need to take the AioContext lock to
>>> protect graph modifications. In particular, we aim to remove
>>> (or better, substitute) the AioContext around bdrv_replace_child_noperm,
>>> since this function changes BlockDriverState's ->parents and ->children
>>> lists.
>>>
>>> In the previous version, we decided to discard using subtree_drains to
>>> protect the nodes, for different reasons: for those unfamiliar with it,
>>> please see 
>>> https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/
>>
>> I reread the thread and it's unclear to me why drain is the wrong
>> mechanism for protecting graph modifications. We theorized a lot but
>> ultimately is this new mechanism sufficiently different from
>> bdrv_drained_begin()/end() to make it worth implementing?
>>
>> Instead of invoking .drained_begin() callbacks to stop further I/O,
>> we're now queuing coroutines (without backpressure information that
>> whoever is spawning I/O needs so they can stop). The writer still waits
>> for in-flight I/O to finish, including I/O not associated with the bdrv
>> graph we wish to modify (because rdlock is per-AioContext and unrelated
>> to a specific graph). Is this really more lightweight than drain?
>>
>> If I understand correctly, the original goal was to avoid the need to
>> hold the AioContext lock across bdrv_replace_child_noperm(). I would
>> focus on that and use drain for now.
>>
>> Maybe I've missed an important point about why the new mechanism is
>> needed?
> 
> Ping?

label: // read till the end to see why I wrote this here

I was hoping someone from the "No" party would answer to your question,
because as you know we reached this same conclusion together.

We thought about dropping the drain for various reasons, the main one
(at least as far as I understood) is that we are not sure if something
can still happen in between drain_begin/end, and it is a little bit
confusing to use the same mechanism to block I/O and protect the graph.

We then thought about implementing a rwlock. A rdlock would clarify what
we are protecting and who is using the lock. I had a rwlock draft
implementation sent in this thread, but this also lead to additional
problems.
Main problem was that this new lock would introduce nested event loops,
that together with such locking would just create deadlocks.
If readers are in coroutines and writers are not (because graph
operations are not running in coroutines), we have a lot of deadlocks.
If a writer has to take the lock, it must wait all other readers to
finish. And it does it by internally calling AIO_WAIT_WHILE, creating
nested event loop. We don't know what could execute when polling for
events, and for example another writer could be resumed.
Ideally, we want writers in coroutines too.

Additionally, many readers are running in what we call "mixed"
functions: usually implemented automatically with generated_co_wrapper
tag, they let a function (usually bdrv callback) run always in a
coroutine, creating one if necessary. For example, bdrv_flush() makes
sure hat bs->bdrv_co_flush() is always run in a coroutine.
Such mixed functions are used in other callbacks too, making it really
difficult to understand if we are in a coroutine or not, and mostly
important make rwlock usage very difficult.

Which lead us to stepping back once more and try to convert all
BlockDriverState callbacks in coroutines. This would greatly simplify
rwlock usage, because we could make the rwlock coroutine-frendly
(without any AIO_WAIT_WHILE, allowing a writer to wait for readers by
just yielding and queuing itself in coroutine queues).

First step was then to convert all callbacks in coroutines, using
generated_coroutine_wrapper (g_c_w).
A typical g_c_w is implemented in this way:
        if (qemu_in_coroutine()) {
                callback();
        } else { // much simplified
                co = qemu_coroutine_create(callback);
                bdrv_coroutine_enter(bs, co);
                BDRV_POLL_WHILE(bs, coroutine_in_progress);
        }
Once all callbacks are implemented using g_c_w, we can start splitting
the two sides of the if function to only create a coroutine when we are
outside from a bdrv callback.

However, we immediately found a problem while starting to convert the
first callbacks: the AioContext lock is taken around some non coroutine
callbacks! For example, bs->bdrv_open() is always called with the
AioContext lock taken. In addition, callbacks like bdrv_open are
graph-modifying functions, which is probably why we are taking the
Aiocontext lock, and they do not like to run in coroutines.
Anyways, the real problem comes when we create a coroutine in such
places where the AioContext lock is taken and we have a graph-modifying
function.

bdrv_coroutine_enter() calls aio_co_enter(), which in turns first checks
 if the coroutine is entering another context from the current (which is
not the case for open) and if we are already in coroutine (for sure
not). Therefore it resorts to the following calls;
        aio_context_acquire(ctx);
        qemu_aio_coroutine_enter(ctx, co);
        aio_context_release(ctx);
Which is clearly a problem, because we are taking the lock twice: once
from the original caller of the callback, and once here due to the
coroutine. This creates a lot of deadlock situations.

For example, all callers of bdrv_open() always take the AioContext lock.
Often it is taken very high in the call stack, but it's always taken.

Getting rid of the lock around qemu_aio_coroutine_enter() is difficult
too, because coroutines expect to have the lock taken. For example, if
we want to drain from a coroutine, bdrv_co_yield_to_drain releases the
lock for us.

The solution is to get rid of the aio_context lock that was originally
taken in the caller, but the problem is: can we get rid of the
AioContext lock? What is that lock protecting, being so high in the call
stack?
What is the alternative for its usage? We are getting rid of it and
replacing it with nothing.
We are going to drop the AioContext lock to allow coroutines callback,
but leave graph operations unprotected.

We might suppose that many callbacks are called under drain and in
GLOBAL_STATE, which should be enough, but from our experimentation in
the previous series we saw that currently not everything is under drain,
leaving some operations unprotected (remember assert_graph_writable
temporarily disabled, since drain coverage for bdrv_replace_child_noperm
was not 100%?).
Therefore we need to add more drains. But isn't drain what we decided to
drop at the beginning? Why isn't drain good?
goto label;

Thank you,
Emanuele
> 
> Stefan
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]