qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/2] keepalive default


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH 0/2] keepalive default
Date: Fri, 10 Jul 2020 22:25:33 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

09.07.2020 20:14, Vladimir Sementsov-Ogievskiy wrote:
09.07.2020 18:34, Eric Blake wrote:
On 7/9/20 3:35 AM, Daniel P. Berrangé wrote:
On Wed, Jul 08, 2020 at 10:15:37PM +0300, Vladimir Sementsov-Ogievskiy wrote:
Hi all!

We understood, that keepalive is almost superfluous with default 2 hours
in /proc/tcp_keepalive_time. Forcing user to setup keepalive for the
whole system doesn't seem right, better setup it per-socket.

Adding the ability to explicitly configure the keepalive settings makes
sense for QEMU. Completely ignoring system defaults when no explicit
settings are given though is not valid IMHO.

We already have the ability to add per-socket keepalive (see commit aec21d3175, 
in 4.2).  I guess what you are trying to further do is determine whether the 
default value for that field, when not explicitly specified by the user, can 
have saner semantics (default off for chardev sockets, default on for nbd 
clients where retry was enabled).  But since you already have to explicitly 
opt-in to nbd retry, what's so hard about opting in to keepalive at the same 
time, other than more typing?  Given that the easiest way to do this is via a 
machine-coded generation of the command line or QMP command, it doesn't seem 
that hard to just keep things as they are with explicit opt-in to per-socket 
keepalive.


Hmm. Looking at the code, I remember that reconnect is not optional, it works by default 
now. The option we have is "reconnect-delay" which only specify, after how much 
seconds we'll automatically fail the requests, not retrying them (0 seconds by default). 
Still, NBD tries to reconnect in background anyway.

In our downstream we have now old version of nbd-reconnect interface and enabled non-zero 
"delay" by default.



And now we need to migrate to upstream code. Hmm. So I have several options:

1. Set a downstream default for reconnect-delay to be something like 5min. 
[most simple thing to do]
2. Set an upstream default to 5min. [needs a discussion, but may be useful]
3. Force all users (not customers I mean, but other teams (and even one another 
company)) to set reconnect-delay option. [just for completeness, I will not go 
this way]


So, what do you think about [2]? This includes:

 - non-zero reconnect-delay by default, so requests will wait some time for 
reconnect and retry
 - enabled keep-alive and some keep-alive default parameters, to not hang in 
recvmsg for a long time


Some other related ideas:

 - non-blocking nbd_open
   - move open to the coroutine
   - use non-blocking socket connect (for example use 
qio_channel_socket_connect_async, which runs connect in thread). Currently, if 
you try to blockdev-add some unavailable nbd host, vm hangs during connect() 
call which may be about a minute
 - make blockdev-add to be async qmp command (Kevin's series)
 - allow reconnect on open

also, recently I noted, that bdrv_close may hang due to reconnect: it does 
bdrv_flush, which waits for NBD to reconnect.. This seems not convenient, 
probably we should disable reconnect before bdrv_drained_begin() in 
bdrv_close().

--
Best regards,
Vladimir




reply via email to

[Prev in Thread] Current Thread [Next in Thread]