[Qemu-discuss] vfio_err_notifier_handler: Unrecoverable error detected (

This just happened overnight:

Oct 19 05:49:59 host bash[4647]: qemu-system-x86_64: vfio_err_notifier_handler(0000:03:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest

Oct 19 05:50:00 host bash[4647]: qemu-system-x86_64: vfio_err_notifier_handler(0000:03:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

which ended up stopping the guest. Some quick googling yields a few threads that look related:

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg04868.html

https://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg04103.html

However, there doesn't seem to be any actual solution to prevent the error in the future. It looks as if "someone's working on it", but it's not ready yet.

I also noticed this in dmesg, (0000:00:02.0 is the Root Port that bus 03:00.0 is on):

[208697.190826] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010

[208697.190832] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)

[208697.190834] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000

[208697.190835] pcieport 0000:00:02.0: [14] Completion Timeout (First)

[208697.190837] pcieport 0000:00:02.0: broadcast error_detected message

[208697.190840] pcieport 0000:00:02.0: broadcast mmio_enabled message

[208697.190841] pcieport 0000:00:02.0: broadcast resume message

[208697.190843] pcieport 0000:00:02.0: AER: Device recovery successful

Does anyone know the status of this hang/crash and what can be done about it in the short term?

Thanks,

Chuck

From:	Charles Mason
Subject:	[Qemu-discuss] vfio_err_notifier_handler: Unrecoverable error detected (guest halted)
Date:	Wed, 19 Oct 2016 12:37:41 +0700