This just happened overnight:
Oct 19 05:49:59 host bash[4647]: qemu-system-x86_64: vfio_err_notifier_handler(0000:03:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest
Oct 19 05:50:00 host bash[4647]: qemu-system-x86_64: vfio_err_notifier_handler(0000:03:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
which ended up stopping the guest. Some quick googling yields a few threads that look related:
However, there doesn't seem to be any actual solution to prevent the error in the future. It looks as if "someone's working on it", but it's not ready yet.
I also noticed this in dmesg, (0000:00:02.0 is the Root Port that bus 03:00.0 is on):
[208697.190826] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[208697.190832] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[208697.190834] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
[208697.190835] pcieport 0000:00:02.0: [14] Completion Timeout (First)
[208697.190837] pcieport 0000:00:02.0: broadcast error_detected message
[208697.190840] pcieport 0000:00:02.0: broadcast mmio_enabled message
[208697.190841] pcieport 0000:00:02.0: broadcast resume message
[208697.190843] pcieport 0000:00:02.0: AER: Device recovery successful
Does anyone know the status of this hang/crash and what can be done about it in the short term?
Thanks,
Chuck