qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation


From: Jason Wang
Subject: Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
Date: Mon, 22 Jan 2024 12:29:51 +0800

On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
> Hi,
>
> This series enables stage-1 translation support in intel iommu which
> we called "modern" mode. In this mode, we don't do shadowing of
> guest page table for passthrough device but pass stage-1 page table
> to host side to construct a nested domain; we also support emulated
> device by translating the stage-1 page table. There was some effort
> to enable this feature in old days, see [1] for details.
>
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
>
>         .-------------.  .---------------------------.
>         |   vIOMMU    |  | Guest I/O page table      |
>         |             |  '---------------------------'
>         .----------------/
>         | PASID Entry |--- PASID cache flush --+
>         '-------------'                        |
>         |             |                        V
>         |             |           I/O page table pointer in GPA
>         '-------------'
>     Guest
>     ------| Shadow |---------------------------|--------
>           v        v                           v
>     Host
>         .-------------.  .------------------------.
>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>         |             |  '------------------------'
>         .----------------/  |
>         | PASID Entry |     V (Nested xlate)
>         '----------------\.----------------------------------.
>         |             |   | SS for GPA->HPA, unmanaged domain|
>         |             |   '----------------------------------'
>         '-------------'
> Where:
>  - FS = First stage page tables
>  - SS = Second stage page tables
> <Intel VT-d Nested translation>
>
> There are some interactions between VFIO and vIOMMU.
> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
>   use to registers/unregisters IOMMUDevice object.
> * VFIO registers an IOMMUFDDevice object at vfio device realize
>   stage to vIOMMU, this is implemented as a prerequisite series[2].
> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
>   to bind/unbind device to IOMMUFD backed domains, either nested
>   domain or not.
>
> See below diagram:
>
>         VFIO Device                                 Intel IOMMU
>     .-----------------.                         .-------------------.
>     |                 |                         |                   |
>     |       .---------|PCIIOMMUOps              |.-------------.    |
>     |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
>     |       | Device  |------------------------>|| Device list |    |
>     |       .---------|(unset_iommu_device)     |.-------------.    |
>     |                 |                         |       |           |
>     |                 |                         |       V           |
>     |       .---------|         IOMMUFDDeviceOps|  .---------.      |
>     |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
>     |       | link    |<------------------------|  | Device  |      |
>     |       .---------|            (detach_hwpt)|  .---------.      |
>     |                 |                         |       |           |
>     |                 |                         |       ...         |
>     .-----------------.                         .-------------------.
>
> Based on Yi's suggestion, we updated a new design of managing ioas and
> hwpt, made it support multiple iommufd objects and the ERRATA_772415
> case, meanwhile tried to be optimal to share ioas and hwpt whenever
> possible.
>
> Stage-2 page table could be shared by different devices if there is
> no conflict and devices link to same iommufd object, i.e. devices
> under same host IOMMU can share same stage-2 page table. If there
> is conflict, i.e. there is one device under non cache coherency
> mode which is different from others, it requires a seperate
> stage-2 page table in non-CC mode.
>
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. I'm not clear if there is a rare case that
> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this design
> can survive even in that case.
>
> See below example diagram for a full view:
>
>       IntelIOMMUState
>              |
>              V
>     .------------------.    .------------------.    .-------------------.
>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  
> |-->...
>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>     .------------------.    .------------------.    .-------------------.
>              |                       |                              |
>              |                       .-->...                        |
>              V                                                      V
>       .-------------------.    .-------------------.          
> .---------------.
>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) 
> |-->...
>       .-------------------.    .-------------------.          
> .---------------.
>           |            |               |                            |
>           |            |               |                            |
>     .-----------.  .-----------.  .------------.              .------------.
>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>     .-----------.  .-----------.  .------------.              .------------.
>
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
>
> To enable "modern" mode, only need to add "x-scalable-mode=modern".
> i.e. -device intel-iommu,x-scalable-mode=modern,...
>
> Passthrough device should use iommufd backend to work in "modern" mode.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
> If host doens't support nested translation, qemu will fail
> with an unsupported report.
>
> Test done:
> - devices hotplug/unplug
> - different devices linked to different iommufds
>
> PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
> PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
> PATCH5:    Introduce a placeholder variable for scalable modern mode
> PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern mode
> PATCH7-22: Implement first stage page table for passthrough and emulated 
> device

Can we split the series and start from the emulated devices (and have
a qtest for that)? This might help for reviewing.

Thanks




reply via email to

[Prev in Thread] Current Thread [Next in Thread]