qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface


From: Tian, Kevin
Subject: Re: [Qemu-devel] [PATCH v8 01/13] vfio: KABI for migration interface
Date: Tue, 3 Sep 2019 06:57:27 +0000

> From: Alex Williamson [mailto:address@hidden]
> Sent: Saturday, August 31, 2019 12:33 AM
> 
> On Fri, 30 Aug 2019 08:06:32 +0000
> "Tian, Kevin" <address@hidden> wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Friday, August 30, 2019 3:26 PM
> > >
> > [...]
> > > > How does QEMU handle the fact that IOVAs are potentially dynamic
> while
> > > > performing the live portion of a migration?  For example, each time a
> > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > system_memory).  I don't see any QEMU code that intercepts that
> change
> > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > re-used with different GFN backing.  Once the DMA mapping is torn
> down,
> > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > please help me find the code that handles it.
> > >
> > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > >  (translation can be done when DMA happens), then we don't need
> > > worry about transient mapping from IOVA to GFN. Along this way we
> > > also want GFN-based dirty bitmap being reported through VFIO,
> > > similar to what KVM does. For vendor drivers, it needs to translate
> > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > provided by KVM but I'm not sure whether it's exposed now.
> > >
> >
> > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> 
> I thought it was bad enough that we have vendor drivers that depend on
> KVM, but designing a vfio interface that only supports a KVM interface
> is more undesirable.  I also note without comment that gfn_to_memslot()
> is a GPL symbol.  Thanks,

yes it is bad, but sometimes inevitable. If you recall our discussions
back to 3yrs (when discussing the 1st mdev framework), there were similar
hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
creating some shadow structures. gpa->hpa is definitely hypervisor
specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
hypercall in Xen. but VFIO already makes assumption based on KVM-
only flavor when implementing vfio_{un}pin_page_external. So GVT-g
has to maintain an internal abstraction layer to support both Xen and
KVM. Maybe someday we will re-consider introducing some hypervisor
abstraction layer in VFIO, if this issue starts to hurt other devices and
Xen guys are willing to support VFIO.

Back to this IOVA issue, I discussed with Yan and we found another 
hypervisor-agnostic alternative, by learning from vhost. vhost is very
similar to VFIO - DMA also happens in the kernel, while it already 
supports vIOMMU.

Generally speaking, there are three paths of dirty page collection
in Qemu so far (as previously noted, Qemu always tracks the dirty
bitmap in GFN):

1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps 
are updated directly when the guest memory is being updated. For 
example, PCI writes are completed through pci_dma_write, which 
goes through vIOMMU to translate IOVA into GPA and then update 
the bitmap through cpu_physical_memory_set_dirty_range.

2) Memory writes that are not tracked by Qemu are collected by
registering .log_sync() callback, which is invoked in the dirty logging
process. Now there are two users: kvm and vhost.

  2.1) KVM tracks CPU-side memory writes, through write-protection
or EPT A/D bits (+PML). This part is always based on GFN and returned
to Qemu when kvm_log_sync is invoked;

  2.2) vhost tracks kernel-side DMA writes, by interpreting vring
data structure. It maintains an internal iotlb which is synced with
Qemu vIOMMU through a specific interface:
        - new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
for Qemu to keep vhost iotlb in sync
        - new VHOST_IOTLB_MISS message to notify Qemu in case of
a miss in vhost iotlb.
        - Qemu registers a log buffer to kernel vhost driver. The latter
update the buffer (using internal iotlb to get GFN) when serving vring
descriptor.

VFIO could also implement an internal iotlb, so vendor drivers can
utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
don't need re-invent another iotlb protocol as vhost does. vIOMMU
already sends map/unmap ioctl cmds upon any change of IOVA
mapping. We may introduce a v2 map/unmap interface, allowing
Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
doesn't want to cache full-size vIOMMU mappings. 

Definitely this alternative needs more work and possibly less 
performant (if maintaining a small size iotlb) than straightforward
calling into KVM interface. But the gain is also obvious, since it
is fully constrained with VFIO.

Thoughts? :-)

Thanks
Kevin

> 
> Alex
> 
> > Above flow works for software-tracked dirty mechanism, e.g. in
> > KVMGT, where GFN-based 'dirty' is marked when a guest page is
> > mapped into device mmu. IOVA->HPA->GFN translation is done
> > at that time, thus immune from further IOVA->GFN changes.
> >
> > When hardware IOMMU supports D-bit in 2nd level translation (e.g.
> > VT-d rev3.0), there are two scenarios:
> >
> > 1) nested translation: guest manages 1st-level translation (IOVA->GPA)
> > and host manages 2nd-level translation (GPA->HPA). The 2nd-level
> > is not affected by guest mapping operations. So it's OK for IOMMU
> > driver to retrieve GFN-based dirty pages by directly scanning the 2nd-
> > level structure, upon request from user space.
> >
> > 2) shadowed translation (IOVA->HPA) in 2nd level: in such case the dirty
> > information is tied to IOVA. the IOMMU driver is expected to maintain
> > an internal dirty bitmap. Upon any change of IOVA->GPA notification
> > from VFIO, the IOMMU driver should flush dirty status of affected 2nd-level
> > entries to the internal GFN-based bitmap. At this time, again IOVA->HVA
> > ->GPA translation required for GFN-based recording. When userspace
> > queries dirty bitmap, the IOMMU driver needs to flush latest 2nd-level
> > dirty status to internal bitmap, which is then copied to user space.
> >
> > Given the trickiness of 2), we aim to enable 1) on intel-iommu driver.
> >
> > Thanks
> > Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]