qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface


From: Tian, Kevin
Subject: Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
Date: Thu, 12 Sep 2019 23:00:03 +0000

> From: Alex Williamson [mailto:address@hidden]
> Sent: Thursday, September 12, 2019 10:41 PM
> 
> On Tue, 3 Sep 2019 06:57:27 +0000
> "Tian, Kevin" <address@hidden> wrote:
> 
> > > From: Alex Williamson [mailto:address@hidden]
> > > Sent: Saturday, August 31, 2019 12:33 AM
> > >
> > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > "Tian, Kevin" <address@hidden> wrote:
> > >
> > > > > From: Tian, Kevin
> > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > >
> > > > [...]
> > > > > > How does QEMU handle the fact that IOVAs are potentially
> dynamic
> > > while
> > > > > > performing the live portion of a migration?  For example, each
> time a
> > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > MemoryRegionSection pops in or out of the AddressSpace for the
> device
> > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > system_memory).  I don't see any QEMU code that intercepts that
> > > change
> > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> recorded and
> > > > > > translated to GFNs.  The vendor driver can't track these beyond
> getting
> > > > > > an unmap notification since it only knows the IOVA pfns, which
> can be
> > > > > > re-used with different GFN backing.  Once the DMA mapping is
> torn
> > > down,
> > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> QEMU,
> > > > > > please help me find the code that handles it.
> > > > >
> > > > > I'm curious about this part too. Interestingly, I didn't find any
> log_sync
> > > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > > by emulated DMAs are recorded in some implicit way. But KVM
> always
> > > > > reports dirty page in GFN instead of IOVA, regardless of the
> presence of
> > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > > >  (translation can be done when DMA happens), then we don't need
> > > > > worry about transient mapping from IOVA to GFN. Along this way
> we
> > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > >
> > > >
> > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> > >
> > > I thought it was bad enough that we have vendor drivers that depend
> on
> > > KVM, but designing a vfio interface that only supports a KVM interface
> > > is more undesirable.  I also note without comment that
> gfn_to_memslot()
> > > is a GPL symbol.  Thanks,
> >
> > yes it is bad, but sometimes inevitable. If you recall our discussions
> > back to 3yrs (when discussing the 1st mdev framework), there were
> similar
> > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > creating some shadow structures. gpa->hpa is definitely hypervisor
> > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > only flavor when implementing vfio_{un}pin_page_external.
> 
> Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> and
> HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
> to get an HPA and provide an array of HPA pfns back to the caller.  The
> other vGPU mdev vendor manages to make use of this without KVM... the
> KVM interface used by GVT-g is GPL-only.

To be clear it's the assumption on the host-based hypervisors e.g. KVM.
GUP is a perfect example, which doesn't work for Xen since DomU's
memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
through Xen specific hypercalls.

> 
> > So GVT-g
> > has to maintain an internal abstraction layer to support both Xen and
> > KVM. Maybe someday we will re-consider introducing some hypervisor
> > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > Xen guys are willing to support VFIO.
> 
> Once upon a time, we had a KVM specific device assignment interface,
> ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> KVM out of the business of being a (bad) device driver.  We do have
> some awareness and interaction between VFIO and KVM in the vfio-kvm
> pseudo device, but we still try to keep those interfaces generic.  In
> some cases we're not very successful at that, see vfio_group_set_kvm(),
> but that's largely just a mechanism to associate a cookie with a group
> to be consumed by the mdev vendor driver such that it can work with kvm
> external to vfio.  I don't intend to add further hypervisor awareness
> to vfio.
> 
> > Back to this IOVA issue, I discussed with Yan and we found another
> > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > similar to VFIO - DMA also happens in the kernel, while it already
> > supports vIOMMU.
> >
> > Generally speaking, there are three paths of dirty page collection
> > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > bitmap in GFN):
> 
> GFNs or simply PFNs within an AddressSpace?
> 
> > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > are updated directly when the guest memory is being updated. For
> > example, PCI writes are completed through pci_dma_write, which
> > goes through vIOMMU to translate IOVA into GPA and then update
> > the bitmap through cpu_physical_memory_set_dirty_range.
> 
> Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> on the IOMMU AddressSpace.
> 
> > 2) Memory writes that are not tracked by Qemu are collected by
> > registering .log_sync() callback, which is invoked in the dirty logging
> > process. Now there are two users: kvm and vhost.
> >
> >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > to Qemu when kvm_log_sync is invoked;
> >
> >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > data structure. It maintains an internal iotlb which is synced with
> > Qemu vIOMMU through a specific interface:
> >     - new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > for Qemu to keep vhost iotlb in sync
> >     - new VHOST_IOTLB_MISS message to notify Qemu in case of
> > a miss in vhost iotlb.
> >     - Qemu registers a log buffer to kernel vhost driver. The latter
> > update the buffer (using internal iotlb to get GFN) when serving vring
> > descriptor.
> >
> > VFIO could also implement an internal iotlb, so vendor drivers can
> > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > already sends map/unmap ioctl cmds upon any change of IOVA
> > mapping. We may introduce a v2 map/unmap interface, allowing
> > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > doesn't want to cache full-size vIOMMU mappings.
> >
> > Definitely this alternative needs more work and possibly less
> > performant (if maintaining a small size iotlb) than straightforward
> > calling into KVM interface. But the gain is also obvious, since it
> > is fully constrained with VFIO.
> >
> > Thoughts? :-)
> 
> So vhost must then be configuring a listener across system memory
> rather than only against the device AddressSpace like we do in vfio,
> such that it get's log_sync() callbacks for the actual GPA space rather
> than only the IOVA space.  OTOH, QEMU could understand that the device
> AddressSpace has a translate function and apply the IOVA dirty bits to
> the system memory AddressSpace.  Wouldn't it make more sense for
> QEMU
> to perform a log_sync() prior to removing a MemoryRegionSection within
> an AddressSpace and update the GPA rather than pushing GPA awareness
> and potentially large tracking structures into the host kernel?  Thanks,
> 

It is an interesting idea.  One drawback is that log_sync might be
frequently invoked in IOVA case, but I guess the overhead is not much 
compared to the total overhead of emulating the IOTLB invalidation. 
Maybe other folks can better comment why this model was not 
considered before, e.g. when vhost iotlb was introduced.

Thanks
Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]