Re: DMA region abruptly removed from PCI device

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: DMA region abruptly removed from PCI device

From:	Alex Williamson
Subject:	Re: DMA region abruptly removed from PCI device
Date:	Tue, 7 Jul 2020 09:54:53 -0600

On Tue, 7 Jul 2020 10:38:01 +0000
Felipe Franciosi <felipe@nutanix.com> wrote:

> > On Jul 6, 2020, at 3:20 PM, Alex Williamson <alex.williamson@redhat.com> 
> > wrote:
> > 
> > On Mon, 6 Jul 2020 10:55:00 +0000
> > Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> >   
> >> We have an issue when using the VFIO-over-socket libmuser PoC
> >> (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) 
> >> instead of
> >> the VFIO kernel module: we notice that DMA regions used by the emulated 
> >> device
> >> can be abruptly removed while the device is still using them.
> >> 
> >> The PCI device we've implemented is an NVMe controller using SPDK, so it 
> >> polls
> >> the submission queues for new requests. We use the latest SeaBIOS where it 
> >> tries
> >> to boot from the NVMe controller. Several DMA regions are registered
> >> (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are 
> >> created.
> >> From this point SPDK polls both queues. Then, the DMA region where the
> >> submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added 
> >> at the
> >> same IOVA but at a different offset. SPDK crashes soon after as it accesses
> >> invalid memory. There is no other event (e.g. some PCI config space or NVMe
> >> register write) happening between unmapping and mapping the DMA region. My 
> >> guess
> >> is that this behavior is legitimate and that this is solved in the VFIO 
> >> kernel
> >> module by releasing the DMA region only after all references to it have 
> >> been
> >> released, which is handled by vfio_pin/unpin_pages, correct? If this is 
> >> the case
> >> then I suppose we need to implement the same logic in libmuser, but I just 
> >> want
> >> to make sure I'm not missing anything as this is a substantial change.  
> > 
> > The vfio_{pin,unpin}_pages() interface only comes into play for mdev
> > devices and even then it's an announcement that a given mapping is
> > going away and the vendor driver is required to release references.
> > For normal PCI device assignment, vfio-pci is (aside from a few quirks)
> > device agnostic and the IOMMU container mappings are independent of the
> > device.  We do not have any device specific knowledge to know if DMA
> > pages still have device references.  The user's unmap request is
> > absolute, it cannot fail (aside from invalid usage) and upon return
> > there must be no residual mappings or references of the pages.
> > 
> > If you say there's no config space write, ex. clearing bus master from
> > the command register, then something like turning on a vIOMMU might
> > cause a change in the entire address space accessible by the device.
> > This would cause the identity map of IOVA to GPA to be replaced by a
> > new one, perhaps another identity map if iommu=pt or a more restricted
> > mapping if the vIOMMU is used for isolation.
> > 
> > It sounds like you have an incomplete device model, physical devices
> > have their address space adjusted by an IOMMU independent of, but
> > hopefully in collaboration with a device driver.  If a physical device
> > manages to bridge this transition, do what it does.  Thanks,  
> 
> Hi,
> 
> That's what we are trying to work out. IIUC, the problem we are having
> is that a mapping removal was requested but the device was still
> operational. We can surely figure out how to handle that gracefully,
> but I'm trying to get my head around how real hardware handles that.
> Maybe you can add some colour here. :)
> 
> What happens when a device tries to write to a physical address that
> has no memory behind it? Is it an MCE of sorts?

It depends on the system, the write might be silently dropped (a), it
might generate an IOMMU fault (b), or firmware-first platform error
handling might freak out from either (a) or (b) and decide to trigger a
fatal error.  If mappings are getting removed due to bus master enable
getting cleared, I would expect device specific behavior, the device
could either stall or drop transactions.

> I haven't really ever looked at memory hot unplug in detail, but
> after reading some QEMU code this is my understanding:
>
> 1) QEMU makes an ACPI request to the guest OS for mem unplug
> 2) Guest OS acks that memory can be pulled out
> 3) QEMU pulls the memory from the guest
> 
> Before step 3, I'm guessing that QEMU tells all device backends that
> this memory is going away. I suppose that in normal operation, the
> Guest OS will have already stopped using the memory (ie. before step
> 2), so there shouldn't be much to it. But I also suppose a malicious
> guest could go "ah, you want to remove this dimm? sure, let me just
> ask all these devices to start using it first... ok, there you go."
> 
> Is this understanding correct?

Memory hot-unplug is cooperative, the guest OS needs to be able to
vacate the necessary range.  If it can't do that or doesn't want to do
that, it just rejects the operation.  The unplugged memory is removed
from the VM address space, so there's no way it can be malicious.
Devices don't own memory, they just use it.  Drivers within the guest
OS having allocations within the requested memory range, especially if
those allocations are for DMA, would be reason for the guest to reject
the unplug operation.  Drivers within QEMU have no business getting a
vote in this matter, if the guest OS has completed the unplug
operation, the memory must be unmapped.  If the guest OS has overlooked
some inflight DMA target, that's on the guest and the above error
handling, or lack thereof comes into play for those transactions.

> I don't think that's the case we're running into, though, but I think
> we need to consider it at this time. What's probably happening here is
> that the guest went from SeaBIOS to the kernel, a PCI reset happened
> and we didn't plumb that message through correctly. While we are at
> it, we should review the memory hot unplug business.

Looking at IOVAs mapped to the device from the device perspective,
clearing bus master will remove all the mappings.  That will happen
when the guest OS or SeaBIOS sizes the PCI BARs, but the description
above said that no config space accesses were occurring.  Enabling the
vIOMMU would also change the entire address space of the device.  In
transitioning from SeaBIOS to guest kernel, why is the device still
active?  The normal expectation here would be that SeaBIOS accesses the
device to load the kernel and initrd into memory, the device is
quiesced, the guest OS boots, enumerating the I/O and IOMMU,
potentially involving multiple address space changes, then device
drivers load, which should make sure the device is performing DMA to
valid targets.  I'll be curious to see what's causing this mysterious
remove and shift operation.  Thanks,

Alex

[Prev in Thread]

Current Thread

[Next in Thread]

DMA region abruptly removed from PCI device, Thanos Makatos, 2020/07/06
- Re: DMA region abruptly removed from PCI device, Alex Williamson, 2020/07/06
  - Re: DMA region abruptly removed from PCI device, Felipe Franciosi, 2020/07/07
    - Re: DMA region abruptly removed from PCI device, Alex Williamson <=

Prev by Date: Re: [PATCH v5 1/5] virtio-pci: add virtio_pci_optimal_num_queues() helper
Next by Date: [Bug 1886602] Re: Windows 10 very slow with OVMF
Previous by thread: Re: DMA region abruptly removed from PCI device
Next by thread: Re: [PATCH v9 1/5] linux-user/aarch64: Reset btype for signals
Index(es):
- Date
- Thread