qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 1/4] qapi: Add a 'coroutine' flag for commands


From: Kevin Wolf
Subject: Re: [PATCH v4 1/4] qapi: Add a 'coroutine' flag for commands
Date: Wed, 22 Jan 2020 15:35:18 +0100
User-agent: Mutt/1.12.1 (2019-06-15)

Am 22.01.2020 um 13:15 hat Markus Armbruster geschrieben:
> Kevin Wolf <address@hidden> writes:
> 
> > Am 22.01.2020 um 07:32 hat Markus Armbruster geschrieben:
> >> Kevin Wolf <address@hidden> writes:
> >> 
> >> > This patch adds a new 'coroutine' flag to QMP command definitions that
> >> > tells the QMP dispatcher that the command handler is safe to be run in a
> >> > coroutine.
> >> 
> >> I'm afraid I missed this question in my review of v3: when is a handler
> >> *not* safe to be run in a coroutine?
> >
> > That's a hard one to answer fully.
> 
> You're welcome ;)
> 
> > Basically, I think the biggest problem is with calling functions that
> > change their behaviour if run in a coroutine compared to running them
> > outside of coroutine context. In most cases the differences like having
> > a nested event loop instead of yielding are just fine, but they are
> > still subtly different.
> >
> > I know this is vague, but I can assure you that problematic cases exist.
> > I hit one of them with my initial hack that just moved everything into a
> > coroutine. It was related to graph modifications and bdrv_drain and
> > resulted in a hang. For the specifics, I would have to try and reproduce
> > the problem again.
> 
> Interesting.
> 
> Is coroutine-incompatible command handler code necessary or accidental?
> 
> By "necessary" I mean there are (and likely always will be) commands
> that need to do stuff that cannot or should not be done on coroutine
> context.
> 
> "Accidental" is the opposite: coroutine-incompatibility can be regarded
> as a fixable flaw.

I expect it's mostly accidental, but maybe not all of it.

bdrv_drain() is an example of a function that has a problem with running
inside a coroutine: If you schedule another coroutine to run, it will
only run after the currently active coroutine yields. However,
bdrv_drain() doesn't yield, but runs a nested event loop, so this can
lead to deadlocks.

For this reason, bdrv_drain() has code to drop out of coroutine context,
so that you can call it from a coroutine anyway:

    /* Calling bdrv_drain() from a BH ensures the current coroutine yields and
     * other coroutines run if they were queued by aio_co_enter(). */

Now, after reading this, I think the problem in the QMP handler I
observerd was that it called some parts of the drain code without
dropping out of coroutine context first. Well possible that it was the
exact deadlock described by this comment.

(Even in bdrv_drain() it might not be strictly necessary. Maybe it's
possible to redesign bdrv_drain() so that it doesn't have a nested event
loop, but can somehow rely on the normal main loop. Touching it is
scary, though, so I'd rather not unless we have a good reason.)

> How widespread is coroutine-incompatibility?  Common, or just a few
> commands?

Answering this would require a massive code audit. Way out of scope for
this series. This is the reason why I decided to just make it optional
instead of trying to find and fix the cases that would need a fix if we
enabled coroutine handlers unconditionally.

> If coroutine-incompatibility is accidental, then your code to drop out
> of coroutine context can be regarded as a temporary work-around.  Such
> work-arounds receive a modest extra ugliness & complexity budget.
> 
> If coroutine-incompatibility is rare, we'll eventually want "mark the
> known-bad ones with 'coroutine': false" instead of "mark the known-good
> ones with 'coroutine': true".  I'm okay with marking the known-good ones
> initially, and flipping only later.
> 
> Inability to recognize coroutine-incompatibility by inspection worries
> me.  Can you think of ways to identify causes other than testing things
> to death?

I think it is possible to recognise by inspection, it would just be a
massive effort. Maybe a way to start it could be adding something like a
non_coroutine_fn marker to functions that we know don't want to be
called from coroutine context, and propagating it to their callers.

I guess I would start by looking for nested event loops (AIO_WAIT_WHILE
or BDRV_POLL_WHILE), and then you would have to decide for each whether
it could run into problems. I can't promise this will cover everything,
but it would at least cover the case known from bdrv_drain.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]