[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PULL 54/68] qom: new object to associate device to NUMA node
From: |
Michael S. Tsirkin |
Subject: |
[PULL 54/68] qom: new object to associate device to NUMA node |
Date: |
Tue, 12 Mar 2024 18:28:24 -0400 |
From: Ankit Agrawal <ankita@nvidia.com>
NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
partitioning of the GPU device resources (including device memory) into
several (upto 8) isolated instances. Each of the partitioned memory needs
a dedicated NUMA node to operate. The partitions are not fixed and they
can be created/deleted at runtime.
Unfortunately Linux OS does not provide a means to dynamically create/destroy
NUMA nodes and such feature implementation is not expected to be trivial. The
nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
we utilize the Generic Initiator (GI) Affinity structures that allows
association between nodes and devices. Multiple GI structures per BDF is
possible, allowing creation of multiple nodes by exposing unique PXM in each
of these structures.
Implement the mechanism to build the GI affinity structures as Qemu currently
does not. Introduce a new acpi-generic-initiator object to allow host admin
link a device with an associated NUMA node. Qemu maintains this association
and use this object to build the requisite GI Affinity Structure.
When multiple NUMA nodes are associated with a device, it is required to
create those many number of acpi-generic-initiator objects, each representing
a unique device:node association.
Following is one of a decoded GI affinity structure in VM ACPI SRAT.
[0C8h 0200 1] Subtable Type : 05 [Generic Initiator Affinity]
[0C9h 0201 1] Length : 20
[0CAh 0202 1] Reserved1 : 00
[0CBh 0203 1] Device Handle Type : 01
[0CCh 0204 4] Proximity Domain : 00000007
[0D0h 0208 16] Device Handle : 00 00 20 00 00 00 00 00 00 00 00
00 00 00 00 00
[0E0h 0224 4] Flags (decoded below) : 00000001
Enabled : 1
[0E4h 0228 4] Reserved2 : 00000000
[0E8h 0232 1] Subtable Type : 05 [Generic Initiator Affinity]
[0E9h 0233 1] Length : 20
An admin can provide a range of acpi-generic-initiator objects, each
associating a device (by providing the id through pci-dev argument)
to the desired NUMA node (using the node argument). Currently, only PCI
device is supported.
For the grace hopper system, create a range of 8 nodes and associate that
with the device using the acpi-generic-initiator object. While a configuration
of less than 8 nodes per device is allowed, such configuration will prevent
utilization of the feature to the fullest. The following sample creates 8
nodes per PCI device for a VM with 2 PCI devices and link them to the
respecitve PCI device using acpi-generic-initiator objects:
-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
-numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
-numa node,nodeid=8 -numa node,nodeid=9 \
-device
vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
-object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
-object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
-object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
-object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
-object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
-object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
-object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
-numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \
-numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \
-numa node,nodeid=16 -numa node,nodeid=17 \
-device
vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \
-object acpi-generic-initiator,id=gi8,pci-dev=dev1,node=10 \
-object acpi-generic-initiator,id=gi9,pci-dev=dev1,node=11 \
-object acpi-generic-initiator,id=gi10,pci-dev=dev1,node=12 \
-object acpi-generic-initiator,id=gi11,pci-dev=dev1,node=13 \
-object acpi-generic-initiator,id=gi12,pci-dev=dev1,node=14 \
-object acpi-generic-initiator,id=gi13,pci-dev=dev1,node=15 \
-object acpi-generic-initiator,id=gi14,pci-dev=dev1,node=16 \
-object acpi-generic-initiator,id=gi15,pci-dev=dev1,node=17 \
Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [1]
Cc: Jonathan Cameron <qemu-devel@nongnu.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Message-Id: <20240308145525.10886-2-ankita@nvidia.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
qapi/qom.json | 17 ++++++
include/hw/acpi/acpi_generic_initiator.h | 22 ++++++++
hw/acpi/acpi_generic_initiator.c | 71 ++++++++++++++++++++++++
hw/acpi/meson.build | 1 +
4 files changed, 111 insertions(+)
create mode 100644 include/hw/acpi/acpi_generic_initiator.h
create mode 100644 hw/acpi/acpi_generic_initiator.c
diff --git a/qapi/qom.json b/qapi/qom.json
index 032c6fa037..baae3a183f 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -811,6 +811,21 @@
{ 'struct': 'IOMMUFDProperties',
'data': { '*fd': 'str' } }
+##
+# @AcpiGenericInitiatorProperties:
+#
+# Properties for acpi-generic-initiator objects.
+#
+# @pci-dev: PCI device ID to be associated with the node
+#
+# @node: NUMA node associated with the PCI device
+#
+# Since: 9.0
+##
+{ 'struct': 'AcpiGenericInitiatorProperties',
+ 'data': { 'pci-dev': 'str',
+ 'node': 'uint32' } }
+
##
# @RngProperties:
#
@@ -928,6 +943,7 @@
##
{ 'enum': 'ObjectType',
'data': [
+ 'acpi-generic-initiator',
'authz-list',
'authz-listfile',
'authz-pam',
@@ -999,6 +1015,7 @@
'id': 'str' },
'discriminator': 'qom-type',
'data': {
+ 'acpi-generic-initiator': 'AcpiGenericInitiatorProperties',
'authz-list': 'AuthZListProperties',
'authz-listfile': 'AuthZListFileProperties',
'authz-pam': 'AuthZPAMProperties',
diff --git a/include/hw/acpi/acpi_generic_initiator.h
b/include/hw/acpi/acpi_generic_initiator.h
new file mode 100644
index 0000000000..16de1d3d80
--- /dev/null
+++ b/include/hw/acpi/acpi_generic_initiator.h
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef ACPI_GENERIC_INITIATOR_H
+#define ACPI_GENERIC_INITIATOR_H
+
+#include "qom/object_interfaces.h"
+
+#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
+
+typedef struct AcpiGenericInitiator {
+ /* private */
+ Object parent;
+
+ /* public */
+ char *pci_dev;
+ uint16_t node;
+} AcpiGenericInitiator;
+
+#endif
diff --git a/hw/acpi/acpi_generic_initiator.c b/hw/acpi/acpi_generic_initiator.c
new file mode 100644
index 0000000000..130d6ae8c1
--- /dev/null
+++ b/hw/acpi/acpi_generic_initiator.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "qemu/osdep.h"
+#include "hw/acpi/acpi_generic_initiator.h"
+#include "hw/boards.h"
+#include "qemu/error-report.h"
+
+typedef struct AcpiGenericInitiatorClass {
+ ObjectClass parent_class;
+} AcpiGenericInitiatorClass;
+
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator,
acpi_generic_initiator,
+ ACPI_GENERIC_INITIATOR, OBJECT,
+ { TYPE_USER_CREATABLE },
+ { NULL })
+
+OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
+
+static void acpi_generic_initiator_init(Object *obj)
+{
+ AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+
+ gi->node = MAX_NODES;
+ gi->pci_dev = NULL;
+}
+
+static void acpi_generic_initiator_finalize(Object *obj)
+{
+ AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+
+ g_free(gi->pci_dev);
+}
+
+static void acpi_generic_initiator_set_pci_device(Object *obj, const char *val,
+ Error **errp)
+{
+ AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+
+ gi->pci_dev = g_strdup(val);
+}
+
+static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
+ const char *name, void *opaque,
+ Error **errp)
+{
+ AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+ uint32_t value;
+
+ if (!visit_type_uint32(v, name, &value, errp)) {
+ return;
+ }
+
+ if (value >= MAX_NODES) {
+ error_printf("%s: Invalid NUMA node specified\n",
+ TYPE_ACPI_GENERIC_INITIATOR);
+ exit(1);
+ }
+
+ gi->node = value;
+}
+
+static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
+{
+ object_class_property_add_str(oc, "pci-dev", NULL,
+ acpi_generic_initiator_set_pci_device);
+ object_class_property_add(oc, "node", "int", NULL,
+ acpi_generic_initiator_set_node, NULL, NULL);
+}
diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
index 5441c9b1e4..fa5c07db90 100644
--- a/hw/acpi/meson.build
+++ b/hw/acpi/meson.build
@@ -1,5 +1,6 @@
acpi_ss = ss.source_set()
acpi_ss.add(files(
+ 'acpi_generic_initiator.c',
'acpi_interface.c',
'aml-build.c',
'bios-linker-loader.c',
--
MST
- [PULL 48/68] Revert "hw/i386/pc_sysfw: Inline pc_system_flash_create() and remove it", (continued)
- [PULL 48/68] Revert "hw/i386/pc_sysfw: Inline pc_system_flash_create() and remove it", Michael S. Tsirkin, 2024/03/12
- [PULL 44/68] pcie_sriov: Reset SR-IOV extended capability, Michael S. Tsirkin, 2024/03/12
- [PULL 53/68] hw/i386/pc: Inline pc_cmos_init() into pc_cmos_init_late() and remove it, Michael S. Tsirkin, 2024/03/12
- [PULL 51/68] hw/i386/pc: Avoid one use of the current_machine global, Michael S. Tsirkin, 2024/03/12
- [PULL 46/68] hw/pci: Always call pcie_sriov_pf_reset(), Michael S. Tsirkin, 2024/03/12
- [PULL 45/68] pcie_sriov: Do not reset NumVFs after disabling VFs, Michael S. Tsirkin, 2024/03/12
- [PULL 47/68] pc: q35: Bump max_cpus to 4096 vcpus, Michael S. Tsirkin, 2024/03/12
- [PULL 49/68] Revert "hw/i386/pc: Confine system flash handling to pc_sysfw", Michael S. Tsirkin, 2024/03/12
- [PULL 50/68] hw/i386/pc: Remove "rtc_state" link again, Michael S. Tsirkin, 2024/03/12
- [PULL 52/68] hw/i386/pc: Set "normal" boot device order in pc_basic_device_init(), Michael S. Tsirkin, 2024/03/12
- [PULL 54/68] qom: new object to associate device to NUMA node,
Michael S. Tsirkin <=
- [PULL 58/68] virtio-iommu: Change the default granule to the host page size, Michael S. Tsirkin, 2024/03/12
- [PULL 66/68] hmat acpi: Fix out of bounds access due to missing use of indirection, Michael S. Tsirkin, 2024/03/12
- [PULL 55/68] hw/acpi: Implement the SRAT GI affinity structure, Michael S. Tsirkin, 2024/03/12
- [PULL 57/68] virtio-iommu: Add a granule property, Michael S. Tsirkin, 2024/03/12
- [PULL 56/68] hw/i386/acpi-build: Add support for SRAT Generic Initiator structures, Michael S. Tsirkin, 2024/03/12
- [PULL 62/68] hw/i386/q35: Set virtio-iommu aw-bits default value to 39, Michael S. Tsirkin, 2024/03/12
- [PULL 60/68] virtio-iommu: Trace domain range limits as unsigned int, Michael S. Tsirkin, 2024/03/12
- [PULL 64/68] qemu-options.hx: Document the virtio-iommu-pci aw-bits option, Michael S. Tsirkin, 2024/03/12
- [PULL 67/68] hw/cxl: Fix missing reserved data in CXL Device DVSEC, Michael S. Tsirkin, 2024/03/12
- [PULL 68/68] docs/specs/pvpanic: document shutdown event, Michael S. Tsirkin, 2024/03/12