qemu-stable
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-stable] [Qemu-devel] [PATCH] job: Fix nested aio_poll() hangin


From: Fam Zheng
Subject: Re: [Qemu-stable] [Qemu-devel] [PATCH] job: Fix nested aio_poll() hanging in job_txn_apply
Date: Wed, 22 Aug 2018 13:48:33 +0800
User-agent: Mutt/1.10.1 (2018-07-13)

On Tue, 08/21 11:15, Eric Blake wrote:
> On 08/21/2018 01:45 AM, Fam Zheng wrote:
> > All callers have acquired ctx already. Doing that again results in
> > aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
> > callback cannot make progress because ctx is recursively locked, for
> > example, when drive-backup finishes.
> > 
> > Cc: address@hidden
> > Reported-by: Gu Nini <address@hidden>
> > Signed-off-by: Fam Zheng <address@hidden>
> > ---
> >   job.c | 18 +++++-------------
> >   1 file changed, 5 insertions(+), 13 deletions(-)
> > 
> > diff --git a/job.c b/job.c
> > index fa671b431a..b07c5d0ffc 100644
> > --- a/job.c
> > +++ b/job.c
> > @@ -136,21 +136,13 @@ static void job_txn_del_job(Job *job)
> >       }
> >   }
> > -static int job_txn_apply(JobTxn *txn, int fn(Job *), bool lock)
> > +static int job_txn_apply(JobTxn *txn, int fn(Job *))
> 
> Is it worth adding a comment here that callers must have already acquired
> the job context?

I think most job accessing functions require it. If any, the comment should go
to the public functions like job_finalize().

> 
> However, I was unable to quickly audit whether all callers really did have
> the lock (it balloons into whether all callers of job_finalize() have the
> lock), so I'm reluctant to give R-b.
> 
> > @@ -857,10 +849,10 @@ static void job_completed_txn_success(Job *job)
> >           assert(other_job->ret == 0);
> >       }
> > -    job_txn_apply(txn, job_transition_to_pending, false);
> > +    job_txn_apply(txn, job_transition_to_pending);
> >       /* If no jobs need manual finalization, automatically do so */
> > -    if (job_txn_apply(txn, job_needs_finalize, false) == 0) {
> > +    if (job_txn_apply(txn, job_needs_finalize) == 0) {
> >           job_do_finalize(job);
> 
> That said, the change makes sense here: since the first direct call to
> job_txn_apply() did not need to lock, why should the second indirect call
> through job_do_finalize() need it?
> 
> Or, is the fix to have job_do_finalize() gain a bool parameter, where
> job_completed_txn_success() would pass false to that parameter, while
> job_finalize() would pass true (again, going back to auditing whether all
> callers of job_finalize() have already acquired the context).

There are two callers of job_finalize():

    address@hidden:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize'
    blockdev.c:    job_finalize(&job->job, errp);
    blockdev.c-    aio_context_release(aio_context);
    --
    job-qmp.c:    job_finalize(job, errp);
    job-qmp.c-    aio_context_release(aio_context);
    --
    tests/test-blockjob.c:    job_finalize(&job->job, &error_abort);
    tests/test-blockjob.c-    assert(job->job.status == JOB_STATUS_CONCLUDED);

Ignoring the test, it's very easy to see both callers have acquired the context.

Fam



reply via email to

[Prev in Thread] Current Thread [Next in Thread]