Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") effectively disables device assignment when using
guest_memfd.
This poses a significant challenge as guest_memfd is essential for
confidential guests, thereby blocking device assignment to these VMs.
The initial rationale for disabling device assignment was due to
stale
IOMMU mappings (see Problem section) and the assumption that TEE I/O
(SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
assignment
problem for confidential guests [1]. However, this assumption has
proven
to be incorrect. TEE I/O relies on the ability to operate devices
against
"shared" or untrusted memory, which is crucial for device
initialization
and error recovery scenarios. As a result, the current implementation
does
not adequately support device assignment for confidential guests,
necessitating
a reevaluation of the approach to ensure compatibility and
functionality.
This series enables shared device assignment by notifying VFIO of
page
conversions using an existing framework named RamDiscardListener.
Additionally, there is an ongoing patch set [2] that aims to add 1G
page
support for guest_memfd. This patch set introduces in-place page
conversion,
where private and shared memory share the same physical pages as the
backend.
This development may impact our solution.
We presented our solution in the guest_memfd meeting to discuss its
compatibility with the new changes and potential future directions
(see [3]
for more details). The conclusion was that, although our solution may
not be
the most elegant (see the Limitation section), it is sufficient for
now and
can be easily adapted to future changes.
We are re-posting the patch series with some cleanup and have removed
the RFC
label for the main enabling patches (1-6). The newly-added patch 7 is
still
marked as RFC as it tries to resolve some extension concerns
related to
RamDiscardManager for future usage.
The overview of the patches:
- Patch 1: Export a helper to get intersection of a
MemoryRegionSection
with a given range.
- Patch 2-6: Introduce a new object to manage the guest-memfd with
RamDiscardManager, and notify the shared/private state change
during
conversion.
- Patch 7: Try to resolve a semantics concern related to
RamDiscardManager
i.e. RamDiscardManager is used to manage memory plug/unplug
state
instead of shared/private state. It would affect future users of
RamDiscardManger in confidential VMs. Attach it behind as a RFC
patch[4].
Changes since last version:
- Add a patch to export some generic helper functions from virtio-mem
code.
- Change the bitmap in guest_memfd_manager from default shared to
default
private. This keeps alignment with virtio-mem that 1-setting in
bitmap
represents the populated state and may help to export more
generic
code
if necessary.
- Add the helpers to initialize/uninitialize the guest_memfd_manager
instance
to make it more clear.
- Add a patch to distinguish between the shared/private state change
and
the memory plug/unplug state change in RamDiscardManager.
- RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
chenyi.qiang@intel.com/
---
Background
==========
Confidential VMs have two classes of memory: shared and private
memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.
"guest_memfd" is a new kind of fd whose primary goal is to serve
guest
private memory. The key differences between guest_memfd and normal
memfd
are that guest_memfd is spawned by a KVM ioctl, bound to its owner
VM and
cannot be mapped, read or written by userspace.