[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queu
From: |
Eric Blake |
Subject: |
Re: [PATCH v1] coroutine: avoid inserting duplicate coroutine to co_queue_wakeup |
Date: |
Wed, 27 Mar 2024 17:13:03 -0500 |
User-agent: |
NeoMutt/20240201 |
On Mon, Mar 25, 2024 at 11:50:41AM -0400, Stefan Hajnoczi wrote:
> On Mon, Mar 25, 2024 at 05:18:50PM +0800, zhuyangyang wrote:
> > If g_main_loop_run()/aio_poll() is called in the coroutine context,
> > the pending coroutine may be woken up repeatedly, and the co_queue_wakeup
> > may be disordered.
>
> aio_poll() must not be called from coroutine context:
>
> bool no_coroutine_fn aio_poll(AioContext *ctx, bool blocking);
> ^^^^^^^^^^^^^^^
>
> Coroutines are not supposed to block. Instead, they should yield.
>
> > When the poll() syscall exited in g_main_loop_run()/aio_poll(), it means
> > some listened events is completed. Therefore, the completion callback
> > function is dispatched.
> >
> > If this callback function needs to invoke aio_co_enter(), it will only
> > wake up the coroutine (because we are already in coroutine context),
> > which may cause that the data on this listening event_fd/socket_fd
> > is not read/cleared. When the next poll () exits, it will be woken up again
> > and inserted into the wakeup queue again.
> >
> > For example, if TLS is enabled in NBD, the server will call
> > g_main_loop_run()
> > in the coroutine, and repeatedly wake up the io_read event on a socket.
> > The call stack is as follows:
> >
> > aio_co_enter()
> > aio_co_wake()
> > qio_channel_restart_read()
> > aio_dispatch_handler()
> > aio_dispatch_handlers()
> > aio_dispatch()
> > aio_ctx_dispatch()
> > g_main_context_dispatch()
> > g_main_loop_run()
> > nbd_negotiate_handle_starttls()
>
> This code does not look like it was designed to run in coroutine
> context. Two options:
>
> 1. Don't run it in coroutine context (e.g. use a BH instead). This
> avoids blocking the coroutine but calling g_main_loop_run() is still
> ugly, in my opinion.
>
> 2. Get rid of data.loop and use coroutine APIs instead:
>
> while (!data.complete) {
> qemu_coroutine_yield();
> }
>
> and update nbd_tls_handshake() to call aio_co_wake(data->co) instead
> of g_main_loop_quit(data->loop).
>
> This requires auditing the code to check whether the event loop might
> invoke something that interferes with
> nbd_negotiate_handle_starttls(). Typically this means monitor
> commands or fd activity that could change the state of this
> connection while it is yielded. This is where the real work is but
> hopefully it will not be that hard to figure out.
I agree that 1) is ugly. So far, I've added some temporary
assertions, to see when qio_channel_tls_handshake is reached; it looks
like nbd/server.c is calling it from within coroutine context, but
nbd/client.c is calling it from the main loop. The qio code handles
either, but now I'm stuck in trying to get client.c into having the
right coroutine context; the TLS handshake is done before the usual
BlockDriverState *bs object is available, so I'm not even sure what
aio context, if any, I should be using. But on my first try, I'm
hitting:
qemu-img: ../util/async.c:707: aio_co_enter: Assertion `self != co' failed.
so I obviously got something wrong.
It may be possible to use block_gen_c to turn nbd_receive_negotiate
and nbd_receive_export_list into co_wrapper functions, if I can audit
that all of their code goes through qio (and is therefore
coroutine-safe), but that work is still ongoing.
At any rate, qemu-iotest 233 should be good for testing that changes
in this area work; I've also been testing with libnbd (test
interop/interop-qemu-nbd-tls-certs hits qemu's server.c) and nbdkit
(test tests/test-tls-psk.sh hits qemu's client.c).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org