qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->chi


From: Stefan Hajnoczi
Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock
Date: Mon, 23 May 2022 16:13:32 +0100

On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote:
> Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> > Hi,
> > Sorry for the long email. I've included three topics that may help us 
> > discuss
> > draining and AioContext removal further.
> > 
> > The shortcomings of drain
> > -------------------------
> > I wanted to explore the logical argument that making graph modifications 
> > within
> > a drained section is correct:
> > - Graph modifications and BB/BDS lookup are Global State (GS).
> > - Graph traversal from a BB/BDS pointer is IO.
> > - Only one thread executes GS code at a time.
> > - IO is quiesced within a drained section.
> 
> I think you're mixing two things here and calling both of them IO.
> 
> If a function is marked as IO, that means that it is safe to call from
> I/O requests, which may be running in any iothread (they currently only
> run in the home thread of the node's AioContext, but the function
> annotations already promise that any thread is safe).
> 
> However, if a function is marked as IO, this doesn't necessarily mean
> that it is always only called in the context of an I/O request. It can
> be called by any other code, too.
> 
> So while drain does quiesce all I/O requests, this doesn't mean that
> functions marked as IO won't run any more.

My model is that between bdrv_drained_begin() and bdrv_drained_end()
only the caller is allowed to invoke BB/BDS functions (including
functions marked IO).

The caller isn't strictly one thread and one or no coroutines. The
caller could use threads and coroutines, but the important thing is that
everything else that accesses the BB/BDS is suspended due to
.drained_begin() callbacks (and is_external).

So when you say a function marked IO can be called from outside an I/O
request, that "outside an I/O request" code must be quiesced too.
Otherwise drain is definitely not safe for graph modifications.

> > - Therefore a drained section in GS code suspends graph traversal, other 
> > graph
> >   modifications, and BB/BDS lookup.
> > - Therefore it is safe to modify the graph from a GS drained section.
> 
> So unfortunately, I don't think this conclusion is correct.
> 
> In order to make your assumption true, we would have to require that all
> callers of IO functions must have called bdrv_inc_in_flight(). We would
> also have to find a way to enforce this preferable at compile time or
> with static analysis, or at least with runtime assertions, because it
> would be very easy to get wrong.

Or that they are quiesced by .drained_begin() callbacks or is_external.

Do you have a concrete example?

> 
> > However, I hit on a problem that I think Emanuele and Paolo have already
> > pointed out: draining is GS & IO. This might have worked under the 1 
> > IOThread
> > model but it does not make sense for multi-queue. It is possible to submit 
> > I/O
> > requests in drained sections. How can multiple threads be in drained 
> > sections
> > simultaneously and possibly submit further I/O requests in their drained
> > sections? Those sections wouldn't be "drained" in any useful sense of the 
> > word.
> 
> Drains asks other users not to touch the block node. Currently this only

BTW interesting language choice here: you're using the more general
definition of "other users" and "touch[ing] the block node" rather than
the narrow definition of just "I/O requests". That's exactly how I think
of drain but based on what you wrote above I'm not sure that's how you
think of it too?

> includes, but the desire to use drain for locking means that we need to
> extend it to pretty much any operation on the node, which would include
> calling drain on that block node. So you should never have two callers
> in a drain section at the same time, it would be a bug in this model.

Yes! Drain in its current form doesn't make sense for multi-queue since
multiple threads could be in drained sections at the same time and they
would all be allowed to submit new I/O requests.

> Of course, we know that currently drain is not respected by all relevant
> callers (most importantly, the monitor). If you want to use drain as a
> form of locking, you need to solve this first.
> 
> > It is necessary to adjust how recursive drained sections work:
> > bdrv_drained_begin() must not return while there are deeper nested drained
> > sections.
> > 
> > This is allowed:
> > 
> >      Monitor command            Block job
> >      ---------------            ---------
> >   > bdrv_drained_begin()
> >            .                 > bdrv_drained_begin()
> >            .                 < bdrv_drained_begin()
> >            .                          .
> >            .                          .
> >            .                 > bdrv_drained_end()
> >            .                 < bdrv_drained_end()
> >   < bdrv_drained_begin()
> >            .
> >            .
> >   > bdrv_drained_end()
> >   < bdrv_drained_end()
> 
> Just to make sure I understand the scenario that you have in mind here:
> The monitor command is not what causes the block job to do the draining
> because this is inside bdrv_drained_begin(), so the block job just
> randomly got a callback that caused it to do so (maybe completion)?

Yes, exactly that completion scenario. When the mirror block job
completes exactly when a monitor command calls bdrv_drained_begin(),
mirror_exit_common() deletes a temporary filter BDS node. It involves
drain and modifies the graph.

> 
> Before bdrv_drained_begin() returns, anything is still allowed to run,
> so I agree that this is valid.

Thanks for confirming!

> > This is not allowed:
> > 
> >      Monitor command            Block job
> >      ---------------            ---------
> >   > bdrv_drained_begin()
> >            .                 > bdrv_drained_begin()
> >            .                 < bdrv_drained_begin()
> >            .                          .
> >            .                          .
> >   < bdrv_drained_begin()              .
> >            .                          .
> >            .                 > bdrv_drained_end()
> >            .                 < bdrv_drained_end()
> >   > bdrv_drained_end()
> >   < bdrv_drained_end()
> > 
> > This restriction creates an ordering between bdrv_drained_begin() callers. 
> > In
> > this example the block job must not depend on the monitor command completing
> > first. Otherwise there would be a deadlock just like with two threads wait 
> > for
> > each other while holding a mutex.
> 
> So essentially drain needs to increase bs->in_flight, so that the outer
> drain has to wait for the inner drain section to end before its
> bdrv_drained_begin() can return.
> 
> > For sanity I think draining should be GS-only. IO code should not invoke
> > bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> > simultaneously and have dependencies on each other.
> > 
> > block/mirror.c needs to be modified because it currently drains from IO when
> > mirroring is about to end.
> > 
> > With this change to draining I think the logical argument for correctness 
> > with
> > graph modifications holds.
> 
> What is your suggestion how to modify mirror? It drains so that no new
> requests can sneak in and source and target stay in sync while it
> switches to the completion code running in the main AioContext. You
> can't just drop this.

I haven't read the code in enough depth to come up with a solution.

> 
> > Enforcing GS/IO separation at compile time
> > ------------------------------------------
> > Right now GS/IO asserts check assumptions at runtime. The next step may be
> > using the type system to separate GS and IO APIs so it's impossible for IO 
> > code
> > to accidentally call GS code, for example.
> > 
> >   typedef struct {
> >       BlockDriverState *bs;
> >   } BlockDriverStateIOPtr;
> > 
> >   typedef struct {
> >       BlockDriverState *bs;
> >   } BlockDriverStateGSPtr;
> > 
> > Then APIs can be protected as follows:
> > 
> >   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> > 
> > A function that only has a BlockDriverStateIOPtr cannot call
> > bdrv_set_aio_context_ignore() by mistake since the compiler will complain 
> > that
> > the first argument must be a BlockDriverStateGSPtr.
> 
> And then you have a set of functions that cast from GS to IO pointers,
> but not for the other way around? Or maybe getting as IO pointer would
> even be coupled with automatically increasing bs->in_flight?
> 
> The other option that we were looking into for this was static analysis.
> I had hacked up a small script to check consistency of these annotations
> (without covering function pointers, though) to help me with the review
> of Emanuele's patches that introduced them. If I understand correctly,
> Paolo has scripts to do the same a little bit better.
> 
> As long as we can integrate such a script in 'make check', we wouldn't
> necessarily have to have the churn in the C code to switch everything to
> new wrapper types.

Yes, that's the problem with the typedef Ptr approach. It would involve
a lot of code changes. Maybe static analysis tools are better.

> 
> > Possible steps for AioContext removal
> > -------------------------------------
> > I also wanted to share my assumptions about multi-queue and AioContext 
> > removal.
> > Please let me know if anything seems wrong or questionable:
> > 
> > - IO code can execute in any thread that has an AioContext.
> > - Multiple threads may execute a IO code at the same time.
> > - GS code only execute under the BQL.
> > 
> > For AioContext removal this means:
> > 
> > - bdrv_get_aio_context() becomes mostly meaningless since there is no need 
> > for
> >   a special "home" AioContext.
> > - bdrv_coroutine_enter() becomes mostly meaningless because there is no 
> > need to
> >   run a coroutine in the BDS's AioContext.
> > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because 
> > many
> >   threads/AioContexts may submit new I/O requests. 
> > BlockDevOps.drained_begin()
> >   may be used instead (e.g. to temporarily disable ioeventfds on a 
> > multi-queue
> >   virtio-blk device).
> > - AIO_WAIT_WHILE() simplifies to
> > 
> >     while ((cond)) {
> >         aio_poll(qemu_get_current_aio_context(), true);
> >         ...
> >     }
> 
> Probably not exactly, because you still need aio_wait_kick() to wake up
> the thread. We use AioWait.num_waiters != 0 to decide whether we need to
> schedule a BH in the main thread (because doing so unconditionally for
> every completed request would be very wasteful).
> 
> If you want to be able to run this loop from any thread instead of just
> the main thread, you would have to store somewhere which thread to wake.

qemu_cond_broadcast() can be used since the wakeup is a slow path.

> >   and the distinction between home AioContext and non-home context is
> >   eliminated. AioContext unlocking is dropped.
> > 
> > Does this make sense? I haven't seen these things in recent patch series.
> 
> Intuitively I would agree with most. There may be details that I'm not
> aware of at the moment. I'm not surprised that we haven't seen any such
> things in recent patch series because there is an awful lot of
> preparational work to be done before we can reach this final state.

Yes. I'm mostly hoping to find out whether my view in fundamentally
flawed or very different from what you and others think.

Stefan

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]