qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for ho


From: Jean-Philippe Brucker
Subject: Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling for hotplugged devices
Date: Thu, 25 Jan 2024 18:48:22 +0000

Hi,

On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote:
> Hi Zhenzhong,
> On 1/18/24 08:10, Duan, Zhenzhong wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Cc: mst@redhat.com; clg@redhat.com
> >> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling
> >> for hotplugged devices
> >>
> >> In [1] we attempted to fix a case where a VFIO-PCI device protected
> >> with a virtio-iommu was assigned to an x86 guest. On x86 the physical
> >> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the
> >> virtio-iommu used to expose a 64b address space by default.
> >> Hence the guest was trying to use the full 64b space and we hit
> >> DMA MAP failures. To work around this issue we managed to pass
> >> usable IOVA regions (excluding the out of range space) from VFIO
> >> to the virtio-iommu device. This was made feasible by introducing
> >> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions().
> >> This latter gets called when the IOMMU MR is enabled which
> >> causes the vfio_listener_region_add() to be called.
> >>
> >> However with VFIO-PCI hotplug, this technique fails due to the
> >> race between the call to the callback in the add memory listener
> >> and the virtio-iommu probe request. Indeed the probe request gets
> >> called before the attach to the domain. So in that case the usable
> >> regions are communicated after the probe request and fail to be
> >> conveyed to the guest. To be honest the problem was hinted by
> >> Jean-Philippe in [1] and I should have been more careful at
> >> listening to him and testing with hotplug :-(
> > It looks the global virtio_iommu_config.bypass is never cleared in guest.
> > When guest virtio_iommu driver enable IOMMU, should it clear this
> > bypass attribute?
> > If it could be cleared in viommu_probe(), then qemu will call
> > virtio_iommu_set_config() then virtio_iommu_switch_address_space_all()
> > to enable IOMMU MR. Then both coldplugged and hotplugged devices will work.
> 
> this field is iommu wide while the probe applies on a one device.In
> general I would prefer not to be dependent on the MR enablement. We know
> that the device is likely to be protected and we can collect its
> requirements beforehand.
> 
> >
> > Intel iommu has a similar bit in register GCMD_REG.TE, when guest
> > intel_iommu driver probe set it, on qemu side, 
> > vtd_address_space_refresh_all()
> > is called to enable IOMMU MRs.
> interesting.
> 
> Would be curious to get Jean Philippe's pov.

I'd rather not rely on this, it's hard to justify a driver change based
only on QEMU internals. And QEMU can't count on the driver always clearing
bypass. There could be situations where the guest can't afford to do it,
like if an endpoint is owned by the firmware and has to keep running.

There may be a separate argument for clearing bypass. With a coldplugged
VFIO device the flow is:

1. Map the whole guest address space in VFIO to implement boot-bypass.
   This allocates all guest pages, which takes a while and is wasteful.
   I've actually crashed a host that way, when spawning a guest with too
   much RAM.
2. Start the VM
3. When the virtio-iommu driver attaches a (non-identity) domain to the
   assigned endpoint, then unmap the whole address space in VFIO, and most
   pages are given back to the host.

We can't disable boot-bypass because the BIOS needs it. But instead the
flow could be:

1. Start the VM, with only the virtual endpoints. Nothing to pin.
2. The virtio-iommu driver disables bypass during boot
3. Hotplug the VFIO device. With bypass disabled there is no need to pin
   the whole guest address space, unless the guest explicitly asks for an
   identity domain.

However, I don't know if this is a realistic scenario that will actually
be used.

By the way, do you have an easy way to reproduce the issue described here?
I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux
just allocates 32-bit IOVAs.

> >
> >> For coldplugged device the technique works because we make sure all
> >> the IOMMU MR are enabled once on the machine init done: 94df5b2180
> >> ("virtio-iommu: Fix 64kB host page size VFIO device assignment")
> >> for granule freeze. But I would be keen to get rid of this trick.
> >>
> >> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU
> >> MR to have been enabled and the corresponding vfio_listener_region_add()
> >> to be executed. Instead this series proposes to replace the usage of this
> >> API by the recently introduced PCIIOMMUOps: ba7d12eb8c  ("hw/pci:
> >> modify
> >> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be
> >> called earlier, once the usable IOVA regions have been collected by
> >> VFIO, without the need for the IOMMU MR to be enabled.
> >>
> >> This looks cleaner. In the short term this may also be used for
> >> passing the page size mask, which would allow to get rid of the
> >> hacky transient IOMMU MR enablement mentionned above.
> >>
> >> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space
> >>    https://lore.kernel.org/all/20231019134651.842175-1-
> >> eric.auger@redhat.com/
> >>
> >> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/
> >>
> >> Extra Notes:
> >> With that series, the reserved memory regions are communicated on time
> >> so that the virtio-iommu probe request grabs them. However this is not
> >> sufficient. In some cases (my case), I still see some DMA MAP failures
> >> and the guest keeps on using IOVA ranges outside the geometry of the
> >> physical IOMMU. This is due to the fact the VFIO-PCI device is in the
> >> same iommu group as the pcie root port. Normally the kernel
> >> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call
> >> reserve_iova()
> >> for each reserved IOVA, which carves them out of the allocator. When
> >> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device
> >> the iova domain is already allocated and set and we don't call
> >> iova_reserve_iommu_regions() again for the vfio-pci device. So its
> >> corresponding reserved regions are not properly taken into account.
> > I suspect there is same issue with coldplugged devices. If those devices
> > are in same group, get iova_reserve_iommu_regions() is only called
> > for first device. But other devices's reserved regions are missed.
> 
> Correct
> >
> > Curious how you make passthrough device and pcie root port under same
> > group.
> > When I start a x86 guest with passthrough device, I see passthrough
> > device and pcie root port are in different group.
> >
> > -[0000:00]-+-00.0
> >            +-01.0
> >            +-02.0
> >            +-03.0-[01]----00.0
> >
> > /sys/kernel/iommu_groups/3/devices:
> > 0000:00:03.0
> > /sys/kernel/iommu_groups/7/devices:
> > 0000:01:00.0
> >
> > My qemu cmdline:
> > -device pcie-root-port,id=root0,slot=0
> > -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0
> 
> I just replayed the scenario:
> - if you have a coldplugged vfio-pci device, the pci root port and the
> passthroughed device end up in different iommu groups. On my end I use
> ioh3420 but you confirmed that's the same for the generic pcie-root-port
> - however if you hotplug the vfio-pci device that's a different story:
> they end up in the same group. Don't ask me why. I tried with
> both virtio-iommu and intel iommu and I end up with the same topology.
> That looks really weird to me.

It also took me a while to get hotplug to work on x86:
pcie_cap_slot_plug_cb() didn't get called, instead it would call
ich9_pm_device_plug_cb(). Not sure what I'm doing wrong.
To work around that I instantiated a second pxb-pcie root bus and then a
pcie root port on there. So my command-line looks like:

 -device virtio-iommu
 -device pxb-pcie,id=pcie.1,bus_nr=1
 -device pcie-root-port,chassis=2,id=pcie.2,bus=pcie.1

 device_add vfio-pci,host=00:04.0,bus=pcie.2

And somehow pcieport and the assigned device do end up in separate IOMMU
groups.

Thanks,
Jean




reply via email to

[Prev in Thread] Current Thread [Next in Thread]