[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT e
From: |
Yu Zhang |
Subject: |
Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area |
Date: |
Fri, 24 Aug 2018 15:54:58 +0800 |
User-agent: |
NeoMutt/20180622-66-b94505 |
On Thu, Aug 23, 2018 at 02:34:31PM +0200, Igor Mammedov wrote:
> On Thu, 23 Aug 2018 17:01:33 +0800
> Yu Zhang <address@hidden> wrote:
>
> > On 8/23/2018 2:01 AM, Eduardo Habkost wrote:
> > > On Wed, Aug 22, 2018 at 03:05:36PM +0200, Igor Mammedov wrote:
> > >> On Wed, 22 Aug 2018 12:06:26 +0200
> > >> Laszlo Ersek <address@hidden> wrote:
> > >>
> > >>> On 08/22/18 11:46, Igor Mammedov wrote:
> > >>>> Commit
> > >>>> 10efd7e108 "pc: acpi: fix memory hotplug regression by reducing
> > >>>> stub SRAT entry size"
> > >>>> attemped to fix hotplug regression introduced by
> > >>>> 848a1cc1e "hw/acpi-build: build SRAT memory affinity structures for
> > >>>> DIMM devices"
> > >>>>
> > >>>> fixed issue for Windows/3.0+ linux kernels, however it regressed 2.6
> > >>>> based
> > >>>> kernels (RHEL6) to the point where guest might crash at boot.
> > >>>> Reason is that 2.6 kernel discards SRAT table due too small last entry
> > >>>> which down the road leads to crashes. Hack I've tried in 10efd7e108 is
> > >>>> also
> > >>>> not ACPI spec compliant according to which whole possible RAM should be
> > >>>> described in SRAT. Revert 10efd7e108 to fix regression for 2.6 based
> > >>>> kernels.
> > >>>>
> > >>>> With 10efd7e108 reverted, I've also tried splitting SRAT table
> > >>>> statically
> > >>>> in different ways %/node and %/slot but Windows still fails to online
> > >>>> 2nd pc-dimm hot-plugged into node 0 (as described in 10efd7e108) and
> > >>>> sometimes even coldplugged pc-dimms where affected with static SRAT
> > >>>> partitioning.
> > >>>> The only known so far way where Windows stays happy is when we have 1
> > >>>> SRAT entry in the last node covering all hotplug area.
> > >>>>
> > >>>> Revert 848a1cc1e until we come up with a way to avoid regression
> > >>>> on Windows with hotplug area split in several entries.
> > >>>> Tested this with 2.6/3.0 based kernels (RHEL6/7) and
> > >>>> WS20[08/12/12R2/16]).
> > >>>>
> > >>>> Signed-off-by: Igor Mammedov <address@hidden>
> > >>>> ---
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> CC: address@hidden
> > >>>> ---
> > >>>> hw/i386/acpi-build.c | 73
> > >>>> +++++++++-------------------------------------------
> > >>>> 1 file changed, 12 insertions(+), 61 deletions(-)
> > >>>>
> > >>>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > >>>> index e1ee8ae..1599caa 100644
> > >>>> --- a/hw/i386/acpi-build.c
> > >>>> +++ b/hw/i386/acpi-build.c
> > >>>> @@ -2251,64 +2251,6 @@ build_tpm2(GArray *table_data, BIOSLinker
> > >>>> *linker, GArray *tcpalog)
> > >>>> #define HOLE_640K_START (640 * KiB)
> > >>>> #define HOLE_640K_END (1 * MiB)
> > >>>>
> > >>>> -static void build_srat_hotpluggable_memory(GArray *table_data,
> > >>>> uint64_t base,
> > >>>> - uint64_t len, int
> > >>>> default_node)
> > >>>> -{
> > >>>> - MemoryDeviceInfoList *info_list = qmp_memory_device_list();
> > >>>> - MemoryDeviceInfoList *info;
> > >>>> - MemoryDeviceInfo *mi;
> > >>>> - PCDIMMDeviceInfo *di;
> > >>>> - uint64_t end = base + len, cur, size;
> > >>>> - bool is_nvdimm;
> > >>>> - AcpiSratMemoryAffinity *numamem;
> > >>>> - MemoryAffinityFlags flags;
> > >>>> -
> > >>>> - for (cur = base, info = info_list;
> > >>>> - cur < end;
> > >>>> - cur += size, info = info->next) {
> > >>>> - numamem = acpi_data_push(table_data, sizeof *numamem);
> > >>>> -
> > >>>> - if (!info) {
> > >>>> - /*
> > >>>> - * Entry is required for Windows to enable memory hotplug
> > >>>> in OS
> > >>>> - * and for Linux to enable SWIOTLB when booted with less
> > >>>> than
> > >>>> - * 4G of RAM. Windows works better if the entry sets
> > >>>> proximity
> > >>>> - * to the highest NUMA node in the machine at the end of
> > >>>> the
> > >>>> - * reserved space.
> > >>>> - * Memory devices may override proximity set by this
> > >>>> entry,
> > >>>> - * providing _PXM method if necessary.
> > >>>> - */
> > >>>> - build_srat_memory(numamem, end - 1, 1, default_node,
> > >>>> - MEM_AFFINITY_HOTPLUGGABLE |
> > >>>> MEM_AFFINITY_ENABLED);
> > >>>> - break;
> > >>>> - }
> > >>>> -
> > >>>> - mi = info->value;
> > >>>> - is_nvdimm = (mi->type == MEMORY_DEVICE_INFO_KIND_NVDIMM);
> > >>>> - di = !is_nvdimm ? mi->u.dimm.data : mi->u.nvdimm.data;
> > >>>> -
> > >>>> - if (cur < di->addr) {
> > >>>> - build_srat_memory(numamem, cur, di->addr - cur,
> > >>>> default_node,
> > >>>> - MEM_AFFINITY_HOTPLUGGABLE |
> > >>>> MEM_AFFINITY_ENABLED);
> > >>>> - numamem = acpi_data_push(table_data, sizeof *numamem);
> > >>>> - }
> > >>>> -
> > >>>> - size = di->size;
> > >>>> -
> > >>>> - flags = MEM_AFFINITY_ENABLED;
> > >>>> - if (di->hotpluggable) {
> > >>>> - flags |= MEM_AFFINITY_HOTPLUGGABLE;
> > >>>> - }
> > >>>> - if (is_nvdimm) {
> > >>>> - flags |= MEM_AFFINITY_NON_VOLATILE;
> > >>>> - }
> > >>>> -
> > >>>> - build_srat_memory(numamem, di->addr, size, di->node, flags);
> > >>>> - }
> > >>>> -
> > >>>> - qapi_free_MemoryDeviceInfoList(info_list);
> > >>>> -}
> > >>>> -
> > >>>> static void
> > >>>> build_srat(GArray *table_data, BIOSLinker *linker, MachineState
> > >>>> *machine)
> > >>>> {
> > >>>> @@ -2414,10 +2356,19 @@ build_srat(GArray *table_data, BIOSLinker
> > >>>> *linker, MachineState *machine)
> > >>>> build_srat_memory(numamem, 0, 0, 0, MEM_AFFINITY_NOFLAGS);
> > >>>> }
> > >>>>
> > >>>> + /*
> > >>>> + * Entry is required for Windows to enable memory hotplug in OS
> > >>>> + * and for Linux to enable SWIOTLB when booted with less than
> > >>>> + * 4G of RAM. Windows works better if the entry sets proximity
> > >>>> + * to the highest NUMA node in the machine.
> > >>>> + * Memory devices may override proximity set by this entry,
> > >>>> + * providing _PXM method if necessary.
> > >>>> + */
> > >>>> if (hotplugabble_address_space_size) {
> > >>>> - build_srat_hotpluggable_memory(table_data,
> > >>>> machine->device_memory->base,
> > >>>> -
> > >>>> hotplugabble_address_space_size,
> > >>>> - pcms->numa_nodes - 1);
> > >>>> + numamem = acpi_data_push(table_data, sizeof *numamem);
> > >>>> + build_srat_memory(numamem, machine->device_memory->base,
> > >>>> + hotplugabble_address_space_size,
> > >>>> pcms->numa_nodes - 1,
> > >>>> + MEM_AFFINITY_HOTPLUGGABLE |
> > >>>> MEM_AFFINITY_ENABLED);
> > >>>> }
> > >>>>
> > >>>> build_header(linker, table_data,
> > >>>>
> > >>> So this reverts both 10efd7e108 (which only moved the regression around)
> > >>> and the earlier 848a1cc1e (which had introduced the regression in the
> > >>> first place).
> > >>>
> > >>> If I understand correctly, the issue that 848a1cc1e was meant to solve
> > >>> was that "-device nvdimm,node=..." could be passed by the user such that
> > >>> it lead to "proximity domain mismatch". (What was the higher-level goal
> > >>> with that BTW?)
> > >>>
> > >>> However, that mismatch could as well be avoided by simply not passing
> > >>> such "node=..." properties. Is that right?
> > >>>
> > >>> If so, should we perhaps add another patch (temporarily), beyond this
> > >>> revert, that identifies the misconfig in question, and prints a warning
> > >>> about it?
> > >> not exactly, before the patch node=... influenced only _PXM which is
> > >> supposed
> > >> to be evaluated after SRAT table and override whatever was in static
> > >> table.
> > >>
> > >> The patch, as I understood it, was about ACPI spec compliance where
> > >> nvdimm SPA
> > >> in NFIT table should have a corresponding entry in SRAT table with
> > >> MEM_AFFINITY_NON_VOLATILE flag set.
> > >> Whether it solves any real issues aren't known to me and typically for
> > >> Intel contributed patches, author's email doesn't exists anymore so ...
> > >> And even if it fixes some new nvdimm issue, I'd rather have it broken
> > >> than regress memory hotplug that worked fine so far and wait for another
> > >> nvdimm patch that won't break existing features.
> > >>
> > >>> Anyway the revert seems justified to me.
> > >> I've killed quite enough time trying to find out a way to keep
> > >> everyone happy, but alas it's time to give up and let interested
> > >> party to deal with regressions while introducing new stuff for
> > >> nvdimm, hence this revert.
> > > If the original author isn't available, I agree the best option
> > > is to revert it to avoid the regression by now.
> >
> > Thanks Edurado, for informing us, and sorry for the delayed reply.
> > I'm not the author of the original patch(848a1cc1e), and is just trying
> > to understand
> > this part now...
> >
> > And my understanding is that the patch was cooked to make sure data in SRAT
> > align with the ones in NFIT.
> >
> > > Reviewed-by: Eduardo Habkost <address@hidden>
> > >
> > > However, have you considered keeping adding separate entries for
> > > NVDIMM devices only (so we follow the spec), but add a single
> > > (numa_nodes-1, MEM_AFFINITY_HOTPLUGGABLE|MEM_AFFINITY_ENABLED)
> > > entry to the rest?
> >
> > And I think that is what Igor did in patch 10efd7e108.
> >
> > But I feel puzzled why windows has such requirement. And I am also
> > curious, is
> > this a windows specific issue?
> It's is a Windows issue, but QEMU doesn't differentiate between guests
> OSes so in general it should work for all of them or at least do not break
> existing features.
>
Yes, agreed.
>
> > Reverting 848a1cc1e will lead the VM not able to figure out the correct
> > NUMA info,
> > e.g. when we are trying to assign different regions from NVDIMMs
> > residing on different
> > NUMA node.
> that's probably non issue as QEMU's impl. supports only 1:1 mapping,
> i.e. SPA for whole nvdimm goes to 1 node and SPA record already
> contains node ID it belongs to.
>
> The reason why I've acked 848a1cc1e is that it's correct per ACPI spec,
> but unfortunately QE found a test case where it regresses memory hotplug,
> hence a revert. We should find out a way to be spec compliant and
> not to break existing features.
>
I agree it's not easy to match all the requirements. So let's revert this
first. :)
B.R.
Yu
> Perhaps you'd be able to figure out how to accomplish both goals,
> I'd be happy to help you out if you'd need some assistance.
>
>
> >
> > B.R.
> > Yu
> >
> > > This way both use cases should still work as expected. If
> > > Windows break if using NVDIMM + Memory Hotplug at the same time,
> > > maybe that's just a Windows bug we can't work around.
> > >
>
>
- [Qemu-stable] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Igor Mammedov, 2018/08/22
- Re: [Qemu-stable] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Laszlo Ersek, 2018/08/22
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Igor Mammedov, 2018/08/22
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Eduardo Habkost, 2018/08/22
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Igor Mammedov, 2018/08/23
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Eduardo Habkost, 2018/08/23
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Igor Mammedov, 2018/08/24
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Eduardo Habkost, 2018/08/24
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Yu Zhang, 2018/08/23
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Igor Mammedov, 2018/08/23
- Re: [Qemu-stable] [Qemu-devel] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area,
Yu Zhang <=
Re: [Qemu-stable] [PATCH] pc: acpi: revert back to 1 SRAT entry for hotpluggable area, Eduardo Habkost, 2018/08/22