qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v8 01/13] vfio: KABI for migration interface


From: Tian, Kevin
Subject: RE: [PATCH v8 01/13] vfio: KABI for migration interface
Date: Thu, 26 Sep 2019 03:07:08 +0000

> From: Alex Williamson [mailto:address@hidden]
> Sent: Thursday, September 26, 2019 3:06 AM
[...]
> > > > The second point is about write-protection:
> > > >
> > > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > > pages when interpreting certain workload descriptors. Those pages
> > > > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it
> working
> > > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > > There are two options. One is having KVM export a new API for such
> > > > > translation purpose. But as you explained earlier it's not good to
> > > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > > maintaining such knowledge through extended MAP interface,
> > > > > then providing a uniform API for all vendor drivers to use.
> > >
> > > So the argument is that in order to interact with KVM (write protecting
> > > guest memory) there's a missing feature (IOVA to GPA translation), but
> > > we don't want to add an API to KVM for this feature because that would
> > > create a dependency on KVM (for interacting with KVM), so lets add an
> > > API to vfio instead.  That makes no sense to me.  What am I missing?
> > > Thanks,
> > >
> >
> > Then do you have a recommendation how such feature can be
> > implemented cleanly in vendor driver, without introducing direct
> > dependency on KVM?
> 
> I think the disconnect is that these sorts of extensions don't reflect
> things that a physical device can actually do.  The idea of vfio is
> that it's a userspace driver interface.  It provides a channel for the
> user to interact with the device, map device resources, receive
> interrupts, map system memory through the iommu, etc.  Mediated
> devices
> augment this by replacing the physical device the user accesses with a
> software virtualized device.  So then the question becomes why this
> device virtualizing software, ie. the mdev vendor driver, needs to do
> things that a physical device clearly cannot do.  For example, how can
> a physical device write-protect portions of system memory?  Or even,
> why would it need to?  It makes me suspect that mdev is being used to
> bypass the hypervisor, or maybe fill in the gaps for hardware that
> isn't as "mediation friendly" as it claims to be.

We do have one such example on Intel GPU. To support direct cmd
submission from userspace (SVA), kernel driver allocates a doorbell
page (in system memory) for each application and then registers
the page to the GPU. Once the doorbell is armed, the GPU starts
to monitor CPU writes to that page. Then the application can ring the 
GPU by simply writing to the doorbell page to submit cmds. This
possibly makes sense only for integrated devices.

In case that direction submission is not allowed in mediated device
(some auditing work is required in GVT-g), we need to write-protect 
the doorbell page with hypervisor help to mimic the hardware 
behavior. We have prototype work internally, but hasn't sent it out.

> 
> In the case of a physical device discovering an iova translation, this
> is what device iotlbs are for, but as an acceleration and offload
> mechanism for the system iommu rather than a lookup mechanism as
> seems
> to be wanted here.  If we had a system iommu with dirty page tracking,
> I believe that tracking would live in the iommu page tables and
> therefore reflect dirty pages relative to iova.  We'd need to consume
> those dirty page bits before we tear down the iova mappings, much like
> we're suggesting QEMU do here.

Yes. There are two cases:

1) iova shadowing. Say using only 2nd level as today. Here the dirty 
bits are associated to iova. When Qemu is revised to invoke log_sync 
before tearing down any iova mapping, vfio can get the dirty info 
from iommu driver for affected range.

2) iova nesting, where iova->gpa is in 1st level and gpa->hpa is in
2nd level. In that case the iova carried in the map/unmap ioctl is
actually gpa, thus the dirty bits are associated to gpa. In such case,
Qemu should continue to consume gpa-based dirty bitmap, as if
viommu is disabled.

> 
> Unfortunately I also think that KVM and vhost are not really the best
> examples of what we need to do for vfio.  KVM is intimately involved
> with GPAs, so clearly dirty page tracking at that level is not an
> issue.  Vhost tends to circumvent the viommu; it's trying to poke
> directly into guest memory without the help of a physical iommu.  So
> I can't say that I have much faith that QEMU is already properly wired
> with respect to viommu and dirty page tracking, leaving open the
> possibility that a log_sync on iommu region unmap is simply a gap in
> the QEMU migration story.  The vfio migration interface we have on the
> table seems like it could work, but QEMU needs an update and we need to
> define the interface in terms of pfns relative to the address space.

Yan and I did a brief discussion on this. Besides the basic change of
doing log_sync for every iova unmap, there are two others gaps to
be fixed:

1) Today the iova->gpa mapping is maintained in two places: viommu 
page table in guest memory and viotlb in Qemu. viotlb is filled when 
a walk on the viommu page table happens, due to emulation of a virtual
DMA operation from emulated devices or request from vhost devices. 
It's not affected by passthrough device activities though, since the latter 
goes through physical iommu. Per iommu spec, guest iommu driver 
first clears the viommu page table, followed by viotlb invalidation 
request. It's the latter being trapped by Qemu, then vfio is notified 
at that point, where iova->gpa translation will simply fail since no 
valid mapping in viommu page table and very likely no hit in viotlb. 
To fix this gap, we need extend Qemu to cache all the valid iova 
mappings in viommu page table, similar to how vfio does.

2) Then there will be parallel log_sync requests on each vfio device. 
One is from the vcpu thread, when iotlb invalidation request is being 
emulated. The other is from the migration thread, where log_sync is 
requested for the entire guest memory in iterative copies. The 
contention among multiple vCPU threads is already protected through 
iommu lock, but we didn't find such thing between migration thread 
and vcpu threads. Maybe we overlooked something, but ideally the 
whole iova address space should be locked when the migration thread 
is doing mega-sync/translation.

+Yi and Peter for their opinions.

> 
> If GPAs are still needed, what are they for?  The write-protect example
> is clearly a hypervisor level interaction as I assume it's write
> protection relative to the vCPU.  It's a hypervisor specific interface
> to perform that write-protection, so why wouldn't we use a hypervisor
> specific interface to collect the data to perform that operation?
> IOW, if GVT-g already has a KVM dependency, why be concerned about
> adding another GVT-g KVM dependency?  It seems like vfio is just a

This is possibly the way that we have to go, based on discussions
so far. Earlier I just hold the same argument as you emphasized
for vfio - although there are existing KVM dependencies, we want
minimize it. :-) Another worry is what if other vendor drivers may
have similar requirements, then can we invent some generic ways
thus avoid pushing them to do same tricky thing again. Of course, 
we may revisit it later until this issue does become a common 
requirement.

> potentially convenient channel, but as discussed above, vfio has no
> business in GPAs because devices don't operate on GPAs and I've not
> been sold that there's value in vfio getting involved in that address
> space.  Convince me otherwise ;)  Thanks,
> 

Looks none of my arguments is convincible to you :-), so we move
to investigate what should be changed in qemu to support your 
proposal (as discussed above). While this part is on-going, let me
have a last try on my original idea. ;) Just be curious how your
further thought is, regarding to earlier doorbell monitoring 
example for operating GPA on device. If it's just Intel GPU only thing, 
yes we can still fix it in GVT-g itself as you suggested. But I'm just not 
sure about other integrated devices, and also new accelerators 
with coherent bus connected to cpu package. Also we don't need 
call it GPA - it could be named as user_target_address that the iova 
is mapped to, and is the address space that userspace expects
the device to operate for purposes (logging, monitoring, etc.) other 
than for dma (using iova) and for accessing userspace/guest
memory (hva).

Thanks
Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]