qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI


From: David Hildenbrand
Subject: Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI
Date: Fri, 19 Nov 2021 18:56:53 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0

>> I'd really appreciate if we could instead have something that makes virt
>> happy as well ("makes no sense in any physical system"), because virt is
>> most probably the biggest actual consumer of ACPI memory hotplug out
>> there (!).
> 
> No problem with finding such a solution - but it's an ASWG question
> (be it with a code first discussion). I have no idea what other
> operating systems would do with overlapping nodes today.  We need to
> jump through the hoops to make sure any solution is mutually agreed.
> Maybe the solution is a new type of entry or flag that makes it clear
> the 'real' node mapping is not PA range based?

Yeah, something like "we might see hotplug within this range to this
node" would clearly express what could happen as of now in QEMU.

> 
>>
>> I mean, for virt as is we will never know which PA range will belong to
>> which node upfront. All we know is that there is a PA range that could
>> belong to node X-Z. Gluing a single range to a single node doesn't make
>> too much sense for virt, which is why we have just been using it to
>> indicate the maximum possible PFN with a fantasy node.
> 
> I'm not convinced that's true. The physical memory
> is coming from somewhere (assuming RAM backed).  I would assume the ideal
> if going to the effort of passing NUMA into a VM, would be to convey
> the same NUMA characteristics to the VM.  So add it to the VM at
> the PA range that matches the appropriate host system NUMA node.

I think we only have real experience with vNUMA when passing through a
subset of real NUMA nodes -- performance differentiated memory was so
far not part of the bigger picture.

The issues start once you allow for more VM RAM than you have into your
hypervisor, simply because you can due to memory overcommit, file-backed
memory ... which can mess with the PA assumptions.

As you say, with all fully RAM backed (excluding SWAP) there is no
overcommit, there are no emulated RAM devices and things are easier.

>>
>> Overlapping regions would really simplify the whole thing, and I think
>> if we go down that path we should go one step further and indicate the
>> hotpluggable region to all nodes that might see hotplug (QEMU -> all
>> nodes). The ACPI clarification would then be that we can have
>> overlapping ranges and that on overlapping ranges all indicated nodes
>> would be a possible target later. That would make perfect sense to me
>> and make both phys and virt happy.
> 
> One alternative I mentioned briefly earlier is don't use ACPI at all.
> For the new interconnects like CXL the decision was made that it wasn't
> a suitable medium so they had CDAT (which is provided by the device)
> instead. It's an open question how that will be handled by the OS at the
> moment, but once solved (and it will need to be soon) that provides
> a means to specify all the same data you get from ACPI NUMA description,
> and leaves the OS to figure out how to merge it with it's internal
> representation of NUMA.
> 
> For virtio-mem / PCI at least it seems a fairly natural match.

Yes, for virtio-mem-pci it would be a natural match I guess. I yet have
to look into the details. I'd be happy to just use any other mechanism
than ACPI to

a) Tell the OS early about the maximum possible PFN
b) Tell the OS early about possible nodes

>>
>>
>> Two ways to avoid overlapping regions, which aren't much better:
>>
>> 1) Split the hotpluggable region up into fantasy regions and assign one
>> fantasy region to each actual node.
>>
>> The fantasy regions will have nothing to do with reality late (just like
>> what we have right now with the last node getting assigned the whole
>> hotpluggable region) and devices might overlap, but we don't really
>> care, because the devices expose the actual node themselves.
>>
>>
>> 2) Duplicate the hotpluggable region accross all nodes
>>
>> We would have one hotpluggable region with a dedicate PA space, and
>> hotplug the device into the respective node PA space.
>>
>> That can be problematic, though, as we can easily run out of PA space.
>> For example, my Ryzen 9 cannot address anything above 1 TiB. So if we'd
>> have a hotpluggable region of 256 GiB, we'll already be in trouble with
>> more than 3 nodes.
> 
> My assumption was that the reason to do this is to pass through node
> mappings that line up with the underlying physical system.  If that's the case
> then the hotpluggable regions for each node could be made to match what is
> there.
> 
> Your Ryzen 9 would normally only have one node?

Yes. I recon it would support NVDIMMs one might want to expose via a
virtual NUMA node to the VM. I assume it would not be represented via a
dedicated NUMA node to my machine.

> 
> If the intent is to sue these regions for more complex purposes (maybe file
> backed memory devices?) then things get more interesting, but how useful
> is mapping them to conventional NUMA representations?

Emulated NVDIMMs and virtio-pmem are the interesting cases I guess. The
issue is much rather that the PA layout of the real machine does no
longer hold.

-- 
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]