[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v6 6/7] block/nbd-client: nbd reconnect
From: |
Kevin Wolf |
Subject: |
Re: [Qemu-devel] [PATCH v6 6/7] block/nbd-client: nbd reconnect |
Date: |
Fri, 7 Jun 2019 16:54:22 +0200 |
User-agent: |
Mutt/1.11.3 (2019-02-01) |
Am 07.06.2019 um 16:27 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 07.06.2019 16:26, Kevin Wolf wrote:
> > Am 07.06.2019 um 14:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 07.06.2019 11:06, Kevin Wolf wrote:
> >>> Am 07.06.2019 um 05:17 hat Eric Blake geschrieben:
> >>>> On 4/11/19 12:27 PM, Vladimir Sementsov-Ogievskiy wrote:
> >>>>> +static coroutine_fn void nbd_reconnect_loop(NBDConnection *con)
> >>>>> +{
> >>>>> + NBDClientSession *s = nbd_get_client_session(con->bs);
> >>>>> + uint64_t start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> >>>>> + uint64_t delay_ns = s->reconnect_delay * 1000000000UL;
> >>>>
> >>>> Do we have a #define constant for nanoseconds in a second to make this
> >>>> more legible than counting '0's?
> >>>>
> >>>>> + uint64_t timeout = 1000000000UL; /* 1 sec */
> >>>>> + uint64_t max_timeout = 16000000000UL; /* 16 sec */
> >>>>
> >>>> 1 * constant, 16 * constant
> >>>>
> >>>>> +
> >>>>> + nbd_reconnect_attempt(con);
> >>>>> +
> >>>>> + while (nbd_client_connecting(s)) {
> >>>>> + if (s->state == NBD_CLIENT_CONNECTING_WAIT &&
> >>>>> + qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns >
> >>>>> delay_ns)
> >>>>> + {
> >>>>> + s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> >>>>> + qemu_co_queue_restart_all(&s->free_sema);
> >>>>> + }
> >>>>> +
> >>>>> + bdrv_dec_in_flight(con->bs);
> >>>>> + qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, timeout);
> >>>>
> >>>> Another place where I'd like someone more familiar with coroutines to
> >>>> also have a look.
> >>>
> >>> What's the exact question you'd like me to answer?
> >>>
> >>> But anyway, bdrv_dec/inc_in_flight() around the sleep looks wrong to me.
> >>> Either the operation must be waited for in drain, then you can't
> >>> decrease the counter even during the sleep. Or drain doesn't have to
> >>> consider it, then why is the counter even increased in the first place?
> >>>
> >>> The way it currently is, drain can return assuming that there are no
> >>> requests, but after the timeout the request automatically comes back
> >>> while the drain caller assumes that there is no more activity. This
> >>> doesn't look right.
> >>
> >> Hmm.
> >>
> >> This ind/dec around all lifetime of connection coroutine you added in
> >>
> >> commit 5ad81b4946baf948c65cf7e1436d9b74803c1280
> >> Author: Kevin Wolf <address@hidden>
> >> Date: Fri Feb 15 16:53:51 2019 +0100
> >>
> >> nbd: Restrict connection_co reentrance
> >>
> >>
> >> And now I try to connect in endless loop, with qemu_co_sleep_ns() inside.
> >> I need to get a change to bdrv_drain to complete, so I have to sometimes
> >> drop this in_flight request. But I understand your point.
> >
> > Ah, right, I see. I think it's fine to add a second point where we
> > decrease the counter (though I would add a comment telling why we do
> > this) as long as the right conditions are met.
> >
> > The right conditions are probably something like: Once drained, we
> > guarantee that we don't call any callbacks until the drained section
> > ends. In nbd_read_eof() this is true because we can't get an answer if
> > we had no request running.
> >
> > Or actually... This assumes a nice compliant server that doesn't just
> > arbitrarily send unexpected messages. Is the existing code broken if we
> > connect to a malicious server?
> >
> >> Will the following work better?
> >>
> >> bdrv_dec_in_flight(con->bs);
> >> qemu_co_sleep_ns(...);
> >> while (s->drained) {
> >> s->wait_drained_end = true;
> >> qemu_coroutine_yield();
> >> }
> >> bdrv_inc_in_flight(con->bs);
> >>
> >> ...
> >> .drained_begin() {
> >> s->drained = true;
> >> }
> >>
> >> .drained_end() {
> >> if (s->wait_drained_end) {
> >> s->wait_drained_end = false;
> >> aio_co_wake(s->connection_co);
> >> }
> >> }
> >
> > This should make sure that we don't call any callbacks before the drain
> > section has ended.
> >
> > We probably still have a problem because nbd_client_attach_aio_context()
> > reenters the coroutine with qemu_aio_coroutine_enter(), which will cause
> > an assertion failure if co->scheduled is set. So this needs to use a
> > version that can cancel a sleep.
> >
> > I see that your patch currently simply ignores attaching a new context,
> > but then the coroutine stays in the old AioContext. Did I miss some
> > additional code that moves it to the new context somehow or will it just
> > stay where it was if the coroutine happens to be reconnecting when the
> > AioContext was supposed to change?
>
>
> Hmm. Do you mean "in latter case we do nothing" in
>
> void nbd_client_attach_aio_context(BlockDriverState *bs,
> AioContext *new_context)
> {
> NBDClientSession *client = nbd_get_client_session(bs);
>
> /*
> * client->connection_co is either yielded from nbd_receive_reply or
> from
> * nbd_reconnect_loop(), in latter case we do nothing
> */
> if (client->state == NBD_CLIENT_CONNECTED) {
> qio_channel_attach_aio_context(QIO_CHANNEL(client->ioc),
> new_context);
>
> bdrv_inc_in_flight(bs);
>
> /*
> * Need to wait here for the BH to run because the BH must run
> while the
> * node is still drained.
> */
> aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh,
> bs);
> }
> }
>
> ?
> Not sure why I ignored this case. So, I should run if() body unconditionally
> here and support
> interrupting timer-sleeping coroutine in nbd_client_attach_aio_context_bh,
> yes?
Yes, I think so. We have now two places where the coroutine could be
yielded, the old place that simply yielded in a loop and can be
reentered without a problem and the new one in a sleep. Both need to be
supported when we switch to coroutine to a new AioContext.
Kevin