[Qemu-commits] [qemu/qemu] 77ef8f: pci: Use PCI aliases when determining

qemu-commits
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-commits] [qemu/qemu] 77ef8f: pci: Use PCI aliases when determining

From:	Peter Maydell
Subject:	[Qemu-commits] [qemu/qemu] 77ef8f: pci: Use PCI aliases when determining device IOMMU...
Date:	Thu, 07 Nov 2019 06:45:05 -0800
  Branch: refs/heads/master
  Home:   https://github.com/qemu/qemu
  Commit: 77ef8f8db2b2dd9d646a47a6a4154e27a96c929a
      
https://github.com/qemu/qemu/commit/77ef8f8db2b2dd9d646a47a6a4154e27a96c929a
  Author: Alex Williamson <address@hidden>
  Date:   2019-11-05 (Tue, 05 Nov 2019)

  Changed paths:
    M hw/pci/pci.c

  Log Message:
  -----------
  pci: Use PCI aliases when determining device IOMMU address space

PCIe requester IDs are used by modern IOMMUs to differentiate devices
in order to provide a unique IOVA address space per device.  These
requester IDs are composed of the bus/device/function (BDF) of the
requesting device.  Conventional PCI pre-dates this concept and is
simply a shared parallel bus where transactions are claimed by
decoding target ranges rather than the packetized, point-to-point
mechanisms of PCI-express.  In order to interface conventional PCI
to PCIe, the PCIe-to-PCI bridge creates and accepts packetized
transactions on behalf of all downstream devices, using one of two
potential forms of a requester ID relating to the bridge itself or its
subordinate bus.  All downstream devices are therefore aliased by the
bridge's requester ID and it's not possible for the IOMMU to create
unique IOVA spaces for devices downstream of such buses.

At least that's how it works on bare metal.  Until now point we've
ignored this nuance of vIOMMU support in QEMU, creating a unique
AddressSpace per device regardless of the virtual bus topology.

Aside from simply being true to bare metal behavior, there are aspects
of a shared address space that we can use to our advantage when
designing a VM.  For instance, a PCI device assignment scenario where
we have the following IOMMU group on the host system:

  $ ls  /sys/kernel/iommu_groups/1/devices/
  0000:00:01.0  0000:01:00.0  0000:01:00.1

An IOMMU group is considered the smallest set of devices which are
fully DMA isolated from other devices by the IOMMU.  In this case the
root port at 00:01.0 does not guarantee that it prevents peer to peer
traffic between the endpoints on bus 01: and the devices are therefore
grouped together.  VFIO considers an IOMMU group to be the smallest
unit of device ownership and allows only a single shared IOVA space
per group due to the limitations of the isolation.

Therefore, if we attempt to create the following VM, we get an error:

qemu-system-x86_64 -machine q35... \
  -device intel-iommu,intremap=on \
  -device pcie-root-port,addr=1e.0,id=pcie.1 \
  -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
  -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1

qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
0000:01:00.1: group 1 used in multiple address spaces

VFIO only allows a single IOVA space (AddressSpace) for both devices,
but we've placed them into a topology where the vIOMMU expects a
separate AddressSpace for each device.  On bare metal we know that
a conventional PCI bus would provide the sort of aliasing we need
here, forcing the IOMMU to consider these devices to be part of a
single shared IOVA space.  The support provided here does the same
for QEMU, such that we can create a conventional PCI topology to
expose equivalent AddressSpace sharing requirements to the VM:

qemu-system-x86_64 -machine q35... \
  -device intel-iommu,intremap=on \
  -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
  -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
  -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1

There are pros and cons to this configuration; it's not necessarily
recommended, it's simply a tool we can use to create configurations
which may provide additional functionality in spite of host hardware
limitations or as a benefit to the guest configuration or resource
usage.  An incomplete list of pros and cons:

Cons:
 a) Extended PCI configuration space is unavailable to devices
    downstream of a conventional PCI bus.  The degree to which this
    is a drawback depends on the device and guest drivers.
 b) Applying this topology to devices which are already isolated by
    the host IOMMU (singleton IOMMU groups) will result in devices
    which appear to be non-isolated to the VM (non-singleton groups).
    This can limit configurations within the guest, such as userspace
    drivers or nested device assignment.

Pros:
 a) QEMU better emulates bare metal.
 b) Configurations as above are now possible.
 c) Host IOMMU resources and VM locked memory requirements are reduced
    in vIOMMU configurations due to shared IOMMU domains on the host
    and avoidance of duplicate locked memory accounting.

Reviewed-by: Peter Xu <address@hidden>
Signed-off-by: Alex Williamson <address@hidden>
Message-Id: <address@hidden>
Reviewed-by: Michael S. Tsirkin <address@hidden>
Signed-off-by: Michael S. Tsirkin <address@hidden>


  Commit: 977aff1045b01579e73020a271336e559cdd6b58
      
https://github.com/qemu/qemu/commit/977aff1045b01579e73020a271336e559cdd6b58
  Author: Alex Williamson <address@hidden>
  Date:   2019-11-05 (Tue, 05 Nov 2019)

  Changed paths:
    M hw/i386/acpi-build.c

  Log Message:
  -----------
  hw/i386: AMD-Vi IVRS DMA alias support

When we account for DMA aliases in the PCI address space, we can no
longer use a single IVHD entry in the IVRS covering all devices.  We
instead need to walk the PCI bus and create alias ranges when we find
a conventional bus.  These alias ranges cannot overlap with a "Select
All" range (as currently implemented), so we also need to enumerate
each device with IVHD entries.

Importantly, the IVHD entries used here include a Device ID, which is
simply the PCI BDF (Bus/Device/Function).  The guest firmware is
responsible for programming bus numbers, so the final revision of this
table depends on the update mechanism (acpi_build_update) to be called
after guest PCI enumeration.

For an example guest configuration of:

-+-[0000:40]---00.0-[41]----00.0  Intel Corporation 82574L Gigabit Network 
Connection
 \-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
             +-01.0  Device 1234:1111
             +-02.0-[01]----00.0  Intel Corporation 82574L Gigabit Network 
Connection
             +-02.1-[02]----00.0  Red Hat, Inc. QEMU XHCI Host Controller
             +-02.2-[03]--
             +-02.3-[04]--
             +-02.4-[05]--
             +-02.5-[06-09]----00.0-[07-09]--+-00.0-[08]--
             |                               \-01.0-[09]----00.0  Intel 
Corporation 82574L Gigabit Network Connection
             +-02.6-[0a-0c]----00.0-[0b-0c]--+-01.0-[0c]--
             |                               \-03.0  Intel Corporation 82540EM 
Gigabit Ethernet Controller
             +-02.7-[0d]----0e.0  Intel Corporation 82540EM Gigabit Ethernet 
Controller
             +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-04.0  Advanced Micro Devices, Inc. [AMD] Device 0020
             +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller
             +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA 
Controller [AHCI mode]
             \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller

Where we have:

00:02.7 PCI bridge: Intel Corporation 82801 PCI Bridge
 (dmi-to-pci-bridge)
00:03.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
 (pcie-expander-bus)
06:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Upstream)
 (pcie-switch-upstream-port)
07:00.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Downstream)
 (pcie-switch-downstream-port)
07:01.0 PCI bridge: Texas Instruments XIO3130 PCI Express Switch (Downstream)
 (pcie-switch-downstream-port)
0a:00.0 PCI bridge: Red Hat, Inc. Device 000e
 (pcie-to-pci-bridge)

The following IVRS table is produced:

AMD-Vi: Using IVHD type 0x10
AMD-Vi: device: 00:04.0 cap: 0040 seg: 0 flags: d1 info 0000
AMD-Vi:        mmio-addr: 00000000fed80000
AMD-Vi:   DEV_SELECT                     devid: 40:00.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 41:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 41:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:00.0 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 00:01.0 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 00:02.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 01:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 01:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:02.1 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 02:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 02:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:02.2 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 03:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 03:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:02.3 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 04:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 04:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:02.4 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 05:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 05:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:02.5 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 06:00.0 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 07:00.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 08:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 08:1f.7
AMD-Vi:   DEV_SELECT                     devid: 07:01.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START         devid: 09:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END          devid: 09:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:02.6 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 0a:00.0 flags: 00
AMD-Vi:   DEV_ALIAS_RANGE                devid: 0b:00.0 flags: 00 devid_to: 
0b:00.0
AMD-Vi:   DEV_RANGE_END          devid: 0c:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:02.7 flags: 00
AMD-Vi:   DEV_ALIAS_RANGE                devid: 0d:00.0 flags: 00 devid_to: 
00:02.7
AMD-Vi:   DEV_RANGE_END          devid: 0d:1f.7
AMD-Vi:   DEV_SELECT                     devid: 00:03.0 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 00:04.0 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 00:1f.0 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 00:1f.2 flags: 00
AMD-Vi:   DEV_SELECT                     devid: 00:1f.3 flags: 00

Reviewed-by: Peter Xu <address@hidden>
Signed-off-by: Alex Williamson <address@hidden>
Message-Id: <address@hidden>
Reviewed-by: Michael S. Tsirkin <address@hidden>
Signed-off-by: Michael S. Tsirkin <address@hidden>


  Commit: fcccb271e0894bc04078ababb29d3d5e06b79892
      
https://github.com/qemu/qemu/commit/fcccb271e0894bc04078ababb29d3d5e06b79892
  Author: Stefan Hajnoczi <address@hidden>
  Date:   2019-11-06 (Wed, 06 Nov 2019)

  Changed paths:
    M hw/virtio/virtio-bus.c
    M hw/virtio/virtio.c
    M include/hw/virtio/virtio.h

  Log Message:
  -----------
  virtio: notify virtqueue via host notifier when available

Host notifiers are used in several cases:
1. Traditional ioeventfd where virtqueue notifications are handled in
   the main loop thread.
2. IOThreads (aio_handle_output) where virtqueue notifications are
   handled in an IOThread AioContext.
3. vhost where virtqueue notifications are handled by kernel vhost or
   a vhost-user device backend.

Most virtqueue notifications from the guest use the ioeventfd mechanism,
but there are corner cases where QEMU code calls virtio_queue_notify().
This currently honors the host notifier for the IOThreads
aio_handle_output case, but not for the vhost case.  The result is that
vhost does not receive virtqueue notifications from QEMU when
virtio_queue_notify() is called.

This patch extends virtio_queue_notify() to set the host notifier
whenever it is enabled instead of calling the vq->(aio_)handle_output()
function directly.  We track the host notifier state for each virtqueue
separately since some devices may use it only for certain virtqueues.

This fixes the vhost case although it does add a trip through the
eventfd for the traditional ioeventfd case.  I don't think it's worth
adding a fast path for the traditional ioeventfd case because calling
virtio_queue_notify() is rare when ioeventfd is enabled.

Reported-by: Felipe Franciosi <address@hidden>
Signed-off-by: Stefan Hajnoczi <address@hidden>
Message-Id: <address@hidden>
Reviewed-by: Michael S. Tsirkin <address@hidden>
Signed-off-by: Michael S. Tsirkin <address@hidden>


  Commit: 1c5880e785807abcc715a7ee216706e02c1af689
      
https://github.com/qemu/qemu/commit/1c5880e785807abcc715a7ee216706e02c1af689
  Author: Peter Maydell <address@hidden>
  Date:   2019-11-07 (Thu, 07 Nov 2019)

  Changed paths:
    M hw/i386/acpi-build.c
    M hw/pci/pci.c
    M hw/virtio/virtio-bus.c
    M hw/virtio/virtio.c
    M include/hw/virtio/virtio.h

  Log Message:
  -----------
  Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging

virtio, pci: fixes

A couple of bugfixes.

Signed-off-by: Michael S. Tsirkin <address@hidden>

# gpg: Signature made Wed 06 Nov 2019 12:00:19 GMT
# gpg:                using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <address@hidden>" [full]
# gpg:                 aka "Michael S. Tsirkin <address@hidden>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17  0970 C350 3912 AFBE 8E67
#      Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA  8A0D 281F 0DB8 D28D 5469

* remotes/mst/tags/for_upstream:
  virtio: notify virtqueue via host notifier when available
  hw/i386: AMD-Vi IVRS DMA alias support
  pci: Use PCI aliases when determining device IOMMU address space

Signed-off-by: Peter Maydell <address@hidden>


Compare: https://github.com/qemu/qemu/compare/d0f90e1423b4...1c5880e78580
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-commits] [qemu/qemu] 77ef8f: pci: Use PCI aliases when determining device IOMMU..., Peter Maydell <=
Prev by Date: [Qemu-commits] [qemu/qemu] 14d4f0: audio: add -audiodev pa, in|out.latency= to documen...
Next by Date: [Qemu-commits] [qemu/qemu] ee1085: configure: Check if we can use ibv_reg_mr_iova
Previous by thread: [Qemu-commits] [qemu/qemu] 14d4f0: audio: add -audiodev pa, in|out.latency= to documen...
Next by thread: [Qemu-commits] [qemu/qemu] ee1085: configure: Check if we can use ibv_reg_mr_iova
Index(es):
- Date
- Thread