qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->chi


From: Stefan Hajnoczi
Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock
Date: Sun, 22 May 2022 16:06:34 +0100

On Wed, May 18, 2022 at 06:14:17PM +0200, Kevin Wolf wrote:
> Am 18.05.2022 um 14:43 hat Paolo Bonzini geschrieben:
> > On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote:
> > > For example, all callers of bdrv_open() always take the AioContext lock.
> > > Often it is taken very high in the call stack, but it's always taken.
> > 
> > I think it's actually not a problem of who takes the AioContext lock or
> > where; the requirements are contradictory:
> > 
> > * IO_OR_GS_CODE() functions, when called from coroutine context, expect to
> > be called with the AioContext lock taken (example: bdrv_co_yield_to_drain)
> > 
> > * to call these functions with the lock taken, the code has to run in the
> > BDS's home iothread.  Attempts to do otherwise results in deadlocks (the
> > main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot
> > happen without releasing the aiocontext lock)
> > 
> > * running the code in the BDS's home iothread is not possible for
> > GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main
> > thread, but that cannot be guaranteed in general)
> > 
> > > We might suppose that many callbacks are called under drain and in
> > > GLOBAL_STATE, which should be enough, but from our experimentation in
> > > the previous series we saw that currently not everything is under drain,
> > > leaving some operations unprotected (remember assert_graph_writable
> > > temporarily disabled, since drain coverage for bdrv_replace_child_noperm
> > > was not 100%?).
> > > Therefore we need to add more drains. But isn't drain what we decided to
> > > drop at the beginning? Why isn't drain good?
> > 
> > To sum up the patch ordering deadlock that we have right now:
> > 
> > * in some cases, graph manipulations are protected by the AioContext lock
> > 
> > * eliminating the AioContext lock is needed to move callbacks to coroutine
> > contexts (see above for the deadlock scenario)
> > 
> > * moving callbacks to coroutine context is needed by the graph rwlock
> > implementation
> > 
> > On one hand, we cannot protect the graph across manipulations with a graph
> > rwlock without removing the AioContext lock; on the other hand, the
> > AioContext lock is what _right now_ protects the graph.
> > 
> > So I'd rather go back to Emanuele's draining approach.  It may not be
> > beautiful, but it allows progress.  Once that is in place, we can remove the
> > AioContext lock (which mostly protects virtio-blk/virtio-scsi code right
> > now) and reevaluate our next steps.
> 
> If we want to use drain for locking, we need to make sure that drain
> actually does the job correctly. I see two major problems with it:
> 
> The first one is that drain only covers I/O paths, but we need to
> protect against _anything_ touching block nodes. This might mean a
> massive audit and making sure that everything in QEMU that could
> possibly touch a block node is integrated with drain.
> 
> I think Emanuele has argued before that because writes to the graph only
> happen in the main thread and we believe that currently only I/O
> requests are processed in iothreads, this is safe and we don't actually
> need to audit everything.
> 
> This is true as long as the assumption holds true (how do we ensure that
> nobody ever introduces non-I/O code touching a block node in an
> iothread?) and as long as the graph writer never yields or polls. I
> think the latter condition is violated today, a random example is that
> adjusting drain counts in bdrv_replace_child_noperm() does poll. Without
> cooperation from all relevant places, the effectively locked code
> section ends right there, even if the drained section continues. Even if
> we can fix this, verifying that the conditions are met everywhere seems
> not trivial.
> 
> And that's exactly my second major concern: Even if we manage to
> correctly implement things with drain, I don't see a way to meaningfully
> review it. I just don't know how to verify with some confidence that
> it's actually correct and covering everything that needs to be covered.
> 
> Drain is not really a lock. But if you use it as one, the best it can
> provide is advisory locking (callers, inside and outside the block
> layer, need to explicitly support drain instead of having the lock
> applied somewhere in the block layer functions). And even if all
> relevant pieces actually make use of it, it still has an awkward
> interface for locking:
> 
> /* Similar to rdlock(), but doesn't wait for writers to finish. It is
>  * the callers responsibility to make sure that there are no writers. */
> bdrv_inc_in_flight()
> 
> /* Similar to wrlock(). Waits for readers to finish. New readers are not
>  * prevented from starting after it returns. Third parties are politely
>  * asked not to touch the block node while it is drained. */
> bdrv_drained_begin()
> 
> (I think the unlock counterparts actually behave as expected from a real
> lock.)
> 
> Having an actual rdlock() (that waits for writers), and using static
> analysis to verify that all relevant places use it (so that wrlock()
> doesn't rely on politely asking third parties to leave the node alone)
> gave me some confidence that we could verify the result.
> 
> I'm not sure at all how to achieve the same with the drain interface. In
> theory, it's possible. But it complicates the conditions so much that
> I'm pretty much sure that the review wouldn't only be very time
> consuming, but I would make mistakes during the review, rendering it
> useless.
> 
> Maybe throwing some more static analysis on the code can help, not sure.
> It's going to be a bit more complex than with the other approach,
> though.

Hi,
Sorry for the long email. I've included three topics that may help us discuss
draining and AioContext removal further.

The shortcomings of drain
-------------------------
I wanted to explore the logical argument that making graph modifications within
a drained section is correct:
- Graph modifications and BB/BDS lookup are Global State (GS).
- Graph traversal from a BB/BDS pointer is IO.
- Only one thread executes GS code at a time.
- IO is quiesced within a drained section.
- Therefore a drained section in GS code suspends graph traversal, other graph
  modifications, and BB/BDS lookup.
- Therefore it is safe to modify the graph from a GS drained section.

However, I hit on a problem that I think Emanuele and Paolo have already
pointed out: draining is GS & IO. This might have worked under the 1 IOThread
model but it does not make sense for multi-queue. It is possible to submit I/O
requests in drained sections. How can multiple threads be in drained sections
simultaneously and possibly submit further I/O requests in their drained
sections? Those sections wouldn't be "drained" in any useful sense of the word.

It is necessary to adjust how recursive drained sections work:
bdrv_drained_begin() must not return while there are deeper nested drained
sections.

This is allowed:

     Monitor command            Block job
     ---------------            ---------
  > bdrv_drained_begin()
           .                 > bdrv_drained_begin()
           .                 < bdrv_drained_begin()
           .                          .
           .                          .
           .                 > bdrv_drained_end()
           .                 < bdrv_drained_end()
  < bdrv_drained_begin()
           .
           .
  > bdrv_drained_end()
  < bdrv_drained_end()

This is not allowed:

     Monitor command            Block job
     ---------------            ---------
  > bdrv_drained_begin()
           .                 > bdrv_drained_begin()
           .                 < bdrv_drained_begin()
           .                          .
           .                          .
  < bdrv_drained_begin()              .
           .                          .
           .                 > bdrv_drained_end()
           .                 < bdrv_drained_end()
  > bdrv_drained_end()
  < bdrv_drained_end()

This restriction creates an ordering between bdrv_drained_begin() callers. In
this example the block job must not depend on the monitor command completing
first. Otherwise there would be a deadlock just like with two threads wait for
each other while holding a mutex.

For sanity I think draining should be GS-only. IO code should not invoke
bdrv_drained_begin() to avoid ordering problems when multiple threads drain
simultaneously and have dependencies on each other.

block/mirror.c needs to be modified because it currently drains from IO when
mirroring is about to end.

With this change to draining I think the logical argument for correctness with
graph modifications holds.

Enforcing GS/IO separation at compile time
------------------------------------------
Right now GS/IO asserts check assumptions at runtime. The next step may be
using the type system to separate GS and IO APIs so it's impossible for IO code
to accidentally call GS code, for example.

  typedef struct {
      BlockDriverState *bs;
  } BlockDriverStateIOPtr;

  typedef struct {
      BlockDriverState *bs;
  } BlockDriverStateGSPtr;

Then APIs can be protected as follows:

  void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);

A function that only has a BlockDriverStateIOPtr cannot call
bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
the first argument must be a BlockDriverStateGSPtr.

Possible steps for AioContext removal
-------------------------------------
I also wanted to share my assumptions about multi-queue and AioContext removal.
Please let me know if anything seems wrong or questionable:

- IO code can execute in any thread that has an AioContext.
- Multiple threads may execute a IO code at the same time.
- GS code only execute under the BQL.

For AioContext removal this means:

- bdrv_get_aio_context() becomes mostly meaningless since there is no need for
  a special "home" AioContext.
- bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
  run a coroutine in the BDS's AioContext.
- aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
  threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
  may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
  virtio-blk device).
- AIO_WAIT_WHILE() simplifies to

    while ((cond)) {
        aio_poll(qemu_get_current_aio_context(), true);
        ...
    }

  and the distinction between home AioContext and non-home context is
  eliminated. AioContext unlocking is dropped.

Does this make sense? I haven't seen these things in recent patch series.

Stefan

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]