qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 18/21] multi-process: heartbeat messages to remote


From: Stefan Hajnoczi
Subject: Re: [PATCH v7 18/21] multi-process: heartbeat messages to remote
Date: Thu, 2 Jul 2020 14:16:32 +0100

On Sat, Jun 27, 2020 at 10:09:40AM -0700, elena.ufimtseva@oracle.com wrote:
> From: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> 
> In order to detect remote processes which are hung, the
> proxy periodically sends heartbeat messages to confirm if
> the remote process is alive. The remote process responds
> to this heartbeat message to confirm it is alive.
> 
> Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
> Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> ---
>  hw/i386/remote-msg.c     | 14 ++++++++++
>  hw/pci/proxy.c           | 58 ++++++++++++++++++++++++++++++++++++++++
>  include/hw/pci/proxy.h   |  2 ++
>  include/io/mpqemu-link.h |  1 +
>  io/mpqemu-link.c         |  1 +
>  5 files changed, 76 insertions(+)
> 

This patch seems incomplete since no action is taken when the device
fails to respond. vCPU threads that access the device will still get
stuck.

The simplest way to make this useful is to close the connection when a
timeout occurs. Then the G_IO_HUP handler for the UNIX domain socket
should perform connection cleanup. At that point there are a few
choices:

1. Stop guest execution and wait for the host admin to restore the
   mplink so execution can resume. This is similar to how -drive
   rerror=stop pauses the guest when a disk I/O error is encountered.

2. Stop guest execution but defer it until this stale device is actually
   accessed. This maximizes guest uptime. Guests that rarely access the
   device may not notice at all.

3. Return 0 from MemoryRegion read operations and ignore writes. The
   guest continues executing but the device is broken. This is risky
   because device drivers inside the guest may not be ready to deal with
   this. The result could be data loss or corruption.

4. Raise a bus-level event. Maybe PCI error reporting can be used to
   offline the device.

5. Terminate the guest with an error message.

6. ?

Until the heartbeat is fully implemented and tested I suggest dropping
it from this patch series. Remember the G_IO_HUP will happen anyway if
the remote device process terminates.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]