[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 19/49] kvm: Make kvm_convert_memory() obey ram_block_disca
From: |
Michael Roth |
Subject: |
Re: [PATCH v3 19/49] kvm: Make kvm_convert_memory() obey ram_block_discard_is_enabled() |
Date: |
Wed, 20 Mar 2024 14:47:54 -0500 |
On Wed, Mar 20, 2024 at 05:26:00PM +0100, Paolo Bonzini wrote:
> On 3/20/24 09:39, Michael Roth wrote:
> > Some subsystems like VFIO might disable ram block discard for
> > uncoordinated cases. Since kvm_convert_memory()/guest_memfd don't
> > implement a RamDiscardManager handler to convey discard operations to
> > various listeners like VFIO. > Because of this, sequences like the
> > following can result due to stale IOMMU mappings:
>
> Alternatively, should guest-memfd memory regions call
> ram_block_discard_require(true)? This will prevent VFIO from operating, but
> it will avoid consuming twice the memory.
>
> If desirable, guest-memfd support can be changed to implement an extension
> of RamDiscardManager that notifies about private/shared memory changes, and
> then guest-memfd would be able to support coordinated discard. But I wonder
In an earlier/internal version of the SNP+gmem patches (when there was still
a dedicated hostmem-memfd-private backend for restrictedmem/gmem), we had a
rough implementation of RamDiscardManager that did this:
https://github.com/AMDESE/qemu/blob/snp-latest-gmem-v12/backends/hostmem-memfd-private.c#L75
Now that gmem handling is mostly done transparently to the HostMem
backend in use I'm not sure what the right place would be to implement
something similar, but maybe it can be done in a more generic way.
There were some notable downsides to that approach though that I'm a
little hazy on now, but I think they were both kernel limitations:
- VFIO seemed to have some limitation where it expects that the
DMA mapping for a particular iova will be unmapped/mapped with
the same granularity, but for an SNP guest there's no guarantee
that if you flip a 2MB page from shared->private, that it won't
later be flipped private->shared again but this time with a 4K
granularity/sub-range. I think the current code still treats
this as an -EINVAL case. So we end up needing to do everything
with 4K granularity, which I *think* results in 4K IOMMU page
table mappings, but I'd need to confirm.
- VFIO doesn't seem to be optimized for this sort of use case and
generally expects a much larger granularity and defaults to 64K
max DMA entries, so for a 16GB guest you need to configure VFIO
with something like:
vfio_iommu_type1.dma_entry_limit=4194304
I didn't see any reason to suggest that's problematic but it
makes we wonder if there's other stuff me might run into.
> if that's doable at all - how common are shared<->private flips, and is it
> feasible to change the IOMMU page tables every time?
- For OVMF+guest kernel that don't do lazy-acceptance:
I think the bulk of the flipping is during boot where most of
shared GPA ranges get converted to private memory, and then
later on the guest kernel switches memory back to to shared
for stuff like SWIOTLB, and after that I think DMA mappings
would be fairly stable.
- For OVMF+guest kernel that support lazy-acceptance:
The first 4GB get converted to private, and the rest remains
shared until guest kernel needs to allocate memory from it.
I'm not sure if SWIOTLB allocation is optimized to avoid
unecessary flipping if it's allocated from that pool of
still-shared memory, but normal/private allocations will
result in a steady stream of DMA unmap operations as the
guest faults in its working set.
>
> If the real solution is SEV-TIO (which means essentially guest_memfd support
> for VFIO), calling ram_block_discard_require(true) may be the simplest
> stopgap solution.
Hard to guess how cloud vendors will feel about waiting for trusted I/O.
It does make sense in the context of CoCo to expect them to wait, but
would be nice to have a stop-gap to offer like disabling discard, since
it has minimal requirements on the QEMU/VFIO side and might be enough to
get early adopters up and running at least.
All that said, if you think something based around RamDiscardManager
seems tenable given all above then we can re-visit that approach as well.
-Mike
>
> Paolo
>
> > - convert page shared->private
> > - discard shared page
> > - convert page private->shared
> > - new page is allocated
> > - issue DMA operations against that shared page
> >
> > Address this by taking ram_block_discard_is_enabled() into account when
> > deciding whether or not to discard pages.
> >
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > ---
> > accel/kvm/kvm-all.c | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index 53ce4f091e..6ae03c880f 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -2962,10 +2962,14 @@ static int kvm_convert_memory(hwaddr start, hwaddr
> > size, bool to_private)
> > */
> > return 0;
> > } else {
> > - ret = ram_block_discard_range(rb, offset, size);
> > + ret = ram_block_discard_is_disabled()
> > + ? ram_block_discard_range(rb, offset, size)
> > + : 0;
> > }
> > } else {
> > - ret = ram_block_discard_guest_memfd_range(rb, offset, size);
> > + ret = ram_block_discard_is_disabled()
> > + ? ram_block_discard_guest_memfd_range(rb, offset, size)
> > + : 0;
> > }
> > } else {
> > error_report("Convert non guest_memfd backed memory region "
>
- [PATCH v3 12/49] kvm: handle KVM_EXIT_MEMORY_FAULT, (continued)
- [PATCH v3 12/49] kvm: handle KVM_EXIT_MEMORY_FAULT, Michael Roth, 2024/03/20
- [PATCH v3 13/49] [FIXUP] "kvm: handle KVM_EXIT_MEMORY_FAULT": drop qemu_host_page_size, Michael Roth, 2024/03/20
- [PATCH v3 14/49] trace/kvm: Add trace for page convertion between shared and private, Michael Roth, 2024/03/20
- [PATCH v3 15/49] kvm/memory: Make memory type private by default if it has guest memfd backend, Michael Roth, 2024/03/20
- [PATCH v3 16/49] memory: Introduce memory_region_init_ram_guest_memfd(), Michael Roth, 2024/03/20
- [PATCH v3 17/49] pci-host/q35: Move PAM initialization above SMRAM initialization, Michael Roth, 2024/03/20
- [PATCH v3 18/49] q35: Introduce smm_ranges property for q35-pci-host, Michael Roth, 2024/03/20
- [PATCH v3 19/49] kvm: Make kvm_convert_memory() obey ram_block_discard_is_enabled(), Michael Roth, 2024/03/20
- [PATCH v3 20/49] trace/kvm: Add trace for KVM_EXIT_MEMORY_FAULT, Michael Roth, 2024/03/20
- [PATCH v3 01/49] Revert "linux-headers hack" from sevinit2 base tree, Michael Roth, 2024/03/20
- [PATCH v3 21/49] i386/sev: Introduce "sev-common" type to encapsulate common SEV state, Michael Roth, 2024/03/20
- [PATCH v3 22/49] i386/sev: Introduce 'sev-snp-guest' object, Michael Roth, 2024/03/20