[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI
From: |
David Hildenbrand |
Subject: |
Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI |
Date: |
Wed, 17 Nov 2021 19:08:28 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 |
On 17.11.21 15:30, Jonathan Cameron wrote:
> On Tue, 16 Nov 2021 12:11:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
>
>>>>
>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>>>> like that yet, but it might just be a matter of time).
>>>
>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
>>> some by MEMORY entries. Or nodes created dynamically like with normal
>>> hotplug memory.
>>>
>
Hi Jonathan,
> The naming of the define is unhelpful. GENERIC_AFFINITY here corresponds
> to Generic Initiator Affinity. So no good for memory. This is meant for
> representation of accelerators / network cards etc so you can get the NUMA
> characteristics for them accessing Memory in other nodes.
>
> My understanding of 'traditional' memory hotplug is that typically the
> PA into which memory is hotplugged is known at boot time whether or not
> the memory is physically present. As such, you present that in SRAT and rely
> on the EFI memory map / other information sources to know the memory isn't
> there. When it is hotplugged later the address is looked up in SRAT to
> identify
> the NUMA node.
in virtualized environments we use the SRAT only to indicate the hotpluggable
region (-> indicate maximum possible PFN to the guest OS), the actual present
memory+PXM assignment is not done via SRAT. We differ quite a lot here from
actual hardware I think.
>
> That model is less useful for more flexible entities like virtio-mem or
> indeed physical hardware such as CXL type 3 memory devices which typically
> need their own nodes.
>
> For the CXL type 3 option, currently proposal is to use the CXL table entries
> representing Physical Address space regions to work out how many NUMA nodes
> are needed and just create extra ones at boot.
> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
>
> It's a heuristic as we might need more nodes to represent things well kernel
> side, but it's better than nothing and less effort that true dynamic node
> creation.
> If you chase through the earlier versions of Alison's patch you will find some
> discussion of that.
>
> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
>
> That would make all this stuff discoverable via PCI config space rather than
> ACPI
> CDAT is at:
> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> (nothing stops others using it though AFAIK).
>
> However, then we'd actually need either dynamic node creation in the OS, or
> some sort of reserved pool of extra nodes. Long term it may be the most
> flexible option.
I think for virtio-mem it's actually a bit simpler:
a) The user defined on the QEMU cmdline an empty node
b) The user assigned a virtio-mem device to a node, either when
coldplugging or hotplugging the device.
So we don't actually "hotplug" a new node, the (possible) node is already known
to QEMU right when starting up. It's just a matter of exposing that fact to the
guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
It's seems to boild down to an ACPI limitation.
Conceptually, virtio-mem on an empty node in QEMU is not that different from
hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
an empty node. But I guess it all just doesn't work with QEMU as of now.
In current x86-64 code, we define the "hotpluggable region" in
hw/i386/acpi-build.c via
build_srat_memory(table_data, machine->device_memory->base,
hotpluggable_address_space_size, nb_numa_nodes - 1,
MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
So we tell the guest OS "this range is hotpluggable" and "it contains to
this node unless the device says something different". From both values we
can -- when under QEMU -- conclude the maximum possible PFN and the maximum
possible node. But the latter is not what Linux does: it simply maps the last
numa node (indicated in the memory entry) to a PXM
(-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).
I do wonder if we could simply expose the same hotpluggable range via multiple
nodes:
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index a3ad6abd33..6c0ab442ea 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2084,6 +2084,22 @@ build_srat(GArray *table_data, BIOSLinker *linker,
MachineState *machine)
* providing _PXM method if necessary.
*/
if (hotpluggable_address_space_size) {
+ /*
+ * For the guest to "know" about possible nodes, we'll indicate the
+ * same hotpluggable region to all empty nodes.
+ */
+ for (i = 0; i < nb_numa_nodes - 1; i++) {
+ if (machine->numa_state->nodes[i].node_mem > 0) {
+ continue;
+ }
+ build_srat_memory(table_data, machine->device_memory->base,
+ hotpluggable_address_space_size, i,
+ MEM_AFFINITY_HOTPLUGGABLE |
MEM_AFFINITY_ENABLED);
+ }
+ /*
+ * Historically, we always indicated all hotpluggable memory to the
+ * last node -- if it was empty or not.
+ */
build_srat_memory(table_data, machine->device_memory->base,
hotpluggable_address_space_size, nb_numa_nodes - 1,
MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
Of course, this won't make CPU hotplug to empty nodes happy if we don't have
mempory hotplug enabled for a VM. I did not check in detail if that is valid
according to ACPI -- Linux might eat it (did not try yet, though).
--
Thanks,
David / dhildenb
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Igor Mammedov, 2021/11/01
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Gavin Shan, 2021/11/01
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Andrew Jones, 2021/11/02
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Gavin Shan, 2021/11/05
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Igor Mammedov, 2021/11/10
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, David Hildenbrand, 2021/11/10
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Igor Mammedov, 2021/11/12
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, David Hildenbrand, 2021/11/16
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Jonathan Cameron, 2021/11/17
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI,
David Hildenbrand <=
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Jonathan Cameron, 2021/11/18
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, David Hildenbrand, 2021/11/18
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Jonathan Cameron, 2021/11/18
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Jonathan Cameron, 2021/11/19
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, David Hildenbrand, 2021/11/19
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, Jonathan Cameron, 2021/11/19
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, David Hildenbrand, 2021/11/19
- Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI, David Hildenbrand, 2021/11/17