qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/7] Enable shared device assignment


From: David Hildenbrand
Subject: Re: [PATCH 0/7] Enable shared device assignment
Date: Fri, 10 Jan 2025 09:26:02 +0100
User-agent: Mozilla Thunderbird

On 10.01.25 08:06, Chenyi Qiang wrote:


On 1/10/2025 9:42 AM, Alexey Kardashevskiy wrote:


On 9/1/25 19:49, Chenyi Qiang wrote:


On 1/9/2025 4:18 PM, Alexey Kardashevskiy wrote:


On 9/1/25 18:52, Chenyi Qiang wrote:


On 1/8/2025 7:38 PM, Alexey Kardashevskiy wrote:


On 8/1/25 17:28, Chenyi Qiang wrote:
Thanks Alexey for your review!

On 1/8/2025 12:47 PM, Alexey Kardashevskiy wrote:
On 13/12/24 18:08, Chenyi Qiang wrote:
Commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated
discard") effectively disables device assignment when using
guest_memfd.
This poses a significant challenge as guest_memfd is essential for
confidential guests, thereby blocking device assignment to these
VMs.
The initial rationale for disabling device assignment was due to
stale
IOMMU mappings (see Problem section) and the assumption that TEE
I/O
(SEV-TIO, TDX Connect, COVE-IO, etc.) would solve the device-
assignment
problem for confidential guests [1]. However, this assumption has
proven
to be incorrect. TEE I/O relies on the ability to operate devices
against
"shared" or untrusted memory, which is crucial for device
initialization
and error recovery scenarios. As a result, the current
implementation
does
not adequately support device assignment for confidential guests,
necessitating
a reevaluation of the approach to ensure compatibility and
functionality.

This series enables shared device assignment by notifying VFIO of
page
conversions using an existing framework named RamDiscardListener.
Additionally, there is an ongoing patch set [2] that aims to add 1G
page
support for guest_memfd. This patch set introduces in-place page
conversion,
where private and shared memory share the same physical pages as
the
backend.
This development may impact our solution.

We presented our solution in the guest_memfd meeting to discuss its
compatibility with the new changes and potential future directions
(see [3]
for more details). The conclusion was that, although our
solution may
not be
the most elegant (see the Limitation section), it is sufficient for
now and
can be easily adapted to future changes.

We are re-posting the patch series with some cleanup and have
removed
the RFC
label for the main enabling patches (1-6). The newly-added patch
7 is
still
marked as RFC as it tries to resolve some extension concerns
related to
RamDiscardManager for future usage.

The overview of the patches:
- Patch 1: Export a helper to get intersection of a
MemoryRegionSection
       with a given range.
- Patch 2-6: Introduce a new object to manage the guest-memfd with
       RamDiscardManager, and notify the shared/private state change
during
       conversion.
- Patch 7: Try to resolve a semantics concern related to
RamDiscardManager
       i.e. RamDiscardManager is used to manage memory plug/unplug
state
       instead of shared/private state. It would affect future
users of
       RamDiscardManger in confidential VMs. Attach it behind as
a RFC
patch[4].

Changes since last version:
- Add a patch to export some generic helper functions from
virtio-mem
code.
- Change the bitmap in guest_memfd_manager from default shared to
default
       private. This keeps alignment with virtio-mem that 1-
setting in
bitmap
       represents the populated state and may help to export more
generic
code
       if necessary.
- Add the helpers to initialize/uninitialize the
guest_memfd_manager
instance
       to make it more clear.
- Add a patch to distinguish between the shared/private state
change
and
       the memory plug/unplug state change in RamDiscardManager.
- RFC: https://lore.kernel.org/qemu-devel/20240725072118.358923-1-
chenyi.qiang@intel.com/

---

Background
==========
Confidential VMs have two classes of memory: shared and private
memory.
Shared memory is accessible from the host/VMM while private
memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve
guest
private memory. The key differences between guest_memfd and normal
memfd
are that guest_memfd is spawned by a KVM ioctl, bound to its owner
VM and
cannot be mapped, read or written by userspace.

The "cannot be mapped" seems to be not true soon anymore (if not
already).

https://lore.kernel.org/all/20240801090117.3841080-1-
tabba@google.com/T/

Exactly, allowing guest_memfd to do mmap is the direction. I
mentioned
it below with in-place page conversion. Maybe I would move it here to
make it more clear.




In QEMU's implementation, shared memory is allocated with normal
methods
(e.g. mmap or fallocate) while private memory is allocated from
guest_memfd. When a VM performs memory conversions, QEMU frees
pages
via
madvise() or via PUNCH_HOLE on memfd or guest_memfd from one
side and
allocates new pages from the other side.


[...]


One limitation (also discussed in the guest_memfd meeting) is that
VFIO
expects the DMA mapping for a specific IOVA to be mapped and
unmapped
with
the same granularity. The guest may perform partial conversions,
such as
converting a small region within a larger region. To prevent such
invalid
cases, all operations are performed with 4K granularity. The
possible
solutions we can think of are either to enable VFIO to support
partial
unmap

btw the old VFIO does not split mappings but iommufd seems to be
capable
of it - there is iopt_area_split(). What happens if you try
unmapping a
smaller chunk that does not exactly match any mapped chunk? thanks,

iopt_cut_iova() happens in iommufd vfio_compat.c, which is to make
iommufd be compatible with old VFIO_TYPE1. IIUC, it happens with
disable_large_page=true. That means the large IOPTE is also disabled in
IOMMU. So it can do the split easily. See the comment in
iommufd_vfio_set_iommu().

iommufd VFIO compatible mode is a transition from legacy VFIO to
iommufd. For the normal iommufd, it requires the iova/length must be a
superset of a previously mapped range. If not match, will return error.


This is all true but this also means that "The former requires complex
changes in VFIO" is not entirely true - some code is already there.
Thanks,

Hmm, my statement is a little confusing.  The bottleneck is that the
IOMMU driver doesn't support the large page split. So if we want to
enable large page and want to do partial unmap, it requires complex
change.

We won't need to split large pages (if we stick to 4K for now), we need
to split large mappings (not large pages) to allow partial unmapping and
iopt_area_split() seems to be doing this. Thanks,

You mean we can disable large page in iommufd and then VFIO will be able
to do partial unmap. Yes, I think it is doable and we can avoid many
ioctl context switches overhead.

So I understand this correctly: the disable_large_pages=true will imply that we never have PMD mappings such that we can atomically poke a hole in a mapping, without temporarily having to remove a PMD mapping in the iommu table to insert a PTE table?

batch_iommu_map_small() seems to document that behavior.

It's interesting that that comment points out that this is purely "VFIO compatibility", and that it otherwise violates the iommufd invariant: "pairing map/unmap". So, it is against the real iommufd design ...

Back when working on virtio-mem support (RAMDiscardManager), thought there was not way to reliably do atomic partial unmappings.

--
Cheers,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]