qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->chi


From: Kevin Wolf
Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock
Date: Mon, 23 May 2022 15:02:05 +0200

Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> Hi,
> Sorry for the long email. I've included three topics that may help us discuss
> draining and AioContext removal further.
> 
> The shortcomings of drain
> -------------------------
> I wanted to explore the logical argument that making graph modifications 
> within
> a drained section is correct:
> - Graph modifications and BB/BDS lookup are Global State (GS).
> - Graph traversal from a BB/BDS pointer is IO.
> - Only one thread executes GS code at a time.
> - IO is quiesced within a drained section.

I think you're mixing two things here and calling both of them IO.

If a function is marked as IO, that means that it is safe to call from
I/O requests, which may be running in any iothread (they currently only
run in the home thread of the node's AioContext, but the function
annotations already promise that any thread is safe).

However, if a function is marked as IO, this doesn't necessarily mean
that it is always only called in the context of an I/O request. It can
be called by any other code, too.

So while drain does quiesce all I/O requests, this doesn't mean that
functions marked as IO won't run any more.

> - Therefore a drained section in GS code suspends graph traversal, other graph
>   modifications, and BB/BDS lookup.
> - Therefore it is safe to modify the graph from a GS drained section.

So unfortunately, I don't think this conclusion is correct.

In order to make your assumption true, we would have to require that all
callers of IO functions must have called bdrv_inc_in_flight(). We would
also have to find a way to enforce this preferable at compile time or
with static analysis, or at least with runtime assertions, because it
would be very easy to get wrong.

> However, I hit on a problem that I think Emanuele and Paolo have already
> pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> model but it does not make sense for multi-queue. It is possible to submit I/O
> requests in drained sections. How can multiple threads be in drained sections
> simultaneously and possibly submit further I/O requests in their drained
> sections? Those sections wouldn't be "drained" in any useful sense of the 
> word.

Drains asks other users not to touch the block node. Currently this only
includes, but the desire to use drain for locking means that we need to
extend it to pretty much any operation on the node, which would include
calling drain on that block node. So you should never have two callers
in a drain section at the same time, it would be a bug in this model.

Of course, we know that currently drain is not respected by all relevant
callers (most importantly, the monitor). If you want to use drain as a
form of locking, you need to solve this first.

> It is necessary to adjust how recursive drained sections work:
> bdrv_drained_begin() must not return while there are deeper nested drained
> sections.
> 
> This is allowed:
> 
>      Monitor command            Block job
>      ---------------            ---------
>   > bdrv_drained_begin()
>            .                 > bdrv_drained_begin()
>            .                 < bdrv_drained_begin()
>            .                          .
>            .                          .
>            .                 > bdrv_drained_end()
>            .                 < bdrv_drained_end()
>   < bdrv_drained_begin()
>            .
>            .
>   > bdrv_drained_end()
>   < bdrv_drained_end()

Just to make sure I understand the scenario that you have in mind here:
The monitor command is not what causes the block job to do the draining
because this is inside bdrv_drained_begin(), so the block job just
randomly got a callback that caused it to do so (maybe completion)?

Before bdrv_drained_begin() returns, anything is still allowed to run,
so I agree that this is valid.

> This is not allowed:
> 
>      Monitor command            Block job
>      ---------------            ---------
>   > bdrv_drained_begin()
>            .                 > bdrv_drained_begin()
>            .                 < bdrv_drained_begin()
>            .                          .
>            .                          .
>   < bdrv_drained_begin()              .
>            .                          .
>            .                 > bdrv_drained_end()
>            .                 < bdrv_drained_end()
>   > bdrv_drained_end()
>   < bdrv_drained_end()
> 
> This restriction creates an ordering between bdrv_drained_begin() callers. In
> this example the block job must not depend on the monitor command completing
> first. Otherwise there would be a deadlock just like with two threads wait for
> each other while holding a mutex.

So essentially drain needs to increase bs->in_flight, so that the outer
drain has to wait for the inner drain section to end before its
bdrv_drained_begin() can return.

> For sanity I think draining should be GS-only. IO code should not invoke
> bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> simultaneously and have dependencies on each other.
> 
> block/mirror.c needs to be modified because it currently drains from IO when
> mirroring is about to end.
> 
> With this change to draining I think the logical argument for correctness with
> graph modifications holds.

What is your suggestion how to modify mirror? It drains so that no new
requests can sneak in and source and target stay in sync while it
switches to the completion code running in the main AioContext. You
can't just drop this.

> Enforcing GS/IO separation at compile time
> ------------------------------------------
> Right now GS/IO asserts check assumptions at runtime. The next step may be
> using the type system to separate GS and IO APIs so it's impossible for IO 
> code
> to accidentally call GS code, for example.
> 
>   typedef struct {
>       BlockDriverState *bs;
>   } BlockDriverStateIOPtr;
> 
>   typedef struct {
>       BlockDriverState *bs;
>   } BlockDriverStateGSPtr;
> 
> Then APIs can be protected as follows:
> 
>   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> 
> A function that only has a BlockDriverStateIOPtr cannot call
> bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> the first argument must be a BlockDriverStateGSPtr.

And then you have a set of functions that cast from GS to IO pointers,
but not for the other way around? Or maybe getting as IO pointer would
even be coupled with automatically increasing bs->in_flight?

The other option that we were looking into for this was static analysis.
I had hacked up a small script to check consistency of these annotations
(without covering function pointers, though) to help me with the review
of Emanuele's patches that introduced them. If I understand correctly,
Paolo has scripts to do the same a little bit better.

As long as we can integrate such a script in 'make check', we wouldn't
necessarily have to have the churn in the C code to switch everything to
new wrapper types.

> Possible steps for AioContext removal
> -------------------------------------
> I also wanted to share my assumptions about multi-queue and AioContext 
> removal.
> Please let me know if anything seems wrong or questionable:
> 
> - IO code can execute in any thread that has an AioContext.
> - Multiple threads may execute a IO code at the same time.
> - GS code only execute under the BQL.
> 
> For AioContext removal this means:
> 
> - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
>   a special "home" AioContext.
> - bdrv_coroutine_enter() becomes mostly meaningless because there is no need 
> to
>   run a coroutine in the BDS's AioContext.
> - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
>   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
>   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
>   virtio-blk device).
> - AIO_WAIT_WHILE() simplifies to
> 
>     while ((cond)) {
>         aio_poll(qemu_get_current_aio_context(), true);
>         ...
>     }

Probably not exactly, because you still need aio_wait_kick() to wake up
the thread. We use AioWait.num_waiters != 0 to decide whether we need to
schedule a BH in the main thread (because doing so unconditionally for
every completed request would be very wasteful).

If you want to be able to run this loop from any thread instead of just
the main thread, you would have to store somewhere which thread to wake.

>   and the distinction between home AioContext and non-home context is
>   eliminated. AioContext unlocking is dropped.
> 
> Does this make sense? I haven't seen these things in recent patch series.

Intuitively I would agree with most. There may be details that I'm not
aware of at the moment. I'm not surprised that we haven't seen any such
things in recent patch series because there is an awful lot of
preparational work to be done before we can reach this final state.

Kevin

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]