Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get

qemu-arm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15

From:	Cédric Le Goater
Subject:	Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv
Date:	Fri, 12 Jan 2024 09:41:40 +0100
User-agent:	Mozilla Thunderbird

On 1/10/24 07:03, Markus Armbruster wrote:

Peter Xu <peterx@redhat.com> writes:

On Tue, Jan 09, 2024 at 10:22:31PM +0100, Philippe Mathieu-Daudé wrote:

Hi Fabiano,

On 9/1/24 21:21, Fabiano Rosas wrote:

Cédric Le Goater <clg@kaod.org> writes:

On 1/9/24 18:40, Fabiano Rosas wrote:

Cédric Le Goater <clg@kaod.org> writes:

On 1/3/24 20:53, Fabiano Rosas wrote:

Philippe Mathieu-Daudé <philmd@linaro.org> writes:

+Peter/Fabiano

On 2/1/24 17:41, Cédric Le Goater wrote:

On 1/2/24 17:15, Philippe Mathieu-Daudé wrote:

Hi Cédric,

On 2/1/24 15:55, Cédric Le Goater wrote:

On 12/12/23 17:29, Philippe Mathieu-Daudé wrote:

Hi,

When a MPCore cluster is used, the Cortex-A cores belong the the
cluster container, not to the board/soc layer. This series move
the creation of vCPUs to the MPCore private container.

Doing so we consolidate the QOM model, moving common code in a
central place (abstract MPCore parent).


Changing the QOM hierarchy has an impact on the state of the machine
and some fixups are then required to maintain migration compatibility.
This can become a real headache for KVM machines like virt for which
migration compatibility is a feature, less for emulated ones.


All changes are either moving properties (which are not migrated)
or moving non-migrated QOM members (i.e. pointers of ARMCPU, which
is still migrated elsewhere). So I don't see any obvious migration
problem, but I might be missing something, so I Cc'ed Juan :>


FWIW, I didn't spot anything problematic either.

I've ran this through my migration compatibility series [1] and it
doesn't regress aarch64 migration from/to 8.2. The tests use '-M
virt -cpu max', so the cortex-a7 and cortex-a15 are not covered. I don't
think we even support migration of anything non-KVM on arm.


it happens we do.


Oh, sorry, I didn't mean TCG here. Probably meant to say something like
non-KVM-capable cpus, as in 32-bit. Nevermind.


Theoretically, we should be able to migrate to a TCG guest. Well, this
worked in the past for PPC. When I was doing more KVM related changes,
this was very useful for dev. Also, some machines are partially emulated.
Anyhow I agree this is not a strong requirement and we often break it.
Let's focus on KVM only.

1- https://gitlab.com/farosas/qemu/-/jobs/5853599533


yes it depends on the QOM hierarchy and virt seems immune to the changes.
Good.

However, changing the QOM topology clearly breaks migration compat,


Well, "clearly" is relative =) You've mentioned pseries and aspeed
already, do you have a pointer to one of those cases were we broke
migration


Regarding pseries, migration compat broke because of 5bc8d26de20c
("spapr: allocate the ICPState object from under sPAPRCPUCore") which
is similar to the changes proposed by this series, it impacts the QOM
hierarchy. Here is the workaround/fix from Greg : 46f7afa37096
("spapr: fix migration of ICPState objects from/to older QEMU") which
is quite an headache and this turned out to raise another problem some
months ago ... :/ That's why I sent [1] to prepare removal of old
machines and workarounds becoming a burden.


This feels like something that could be handled by the vmstate code
somehow. The state is there, just under a different path.


What, the QOM path is used in migration? ...


Hopefully not..


See recent discussions on "QOM path stability":
ZZfYvlmcxBCiaeWE@redhat.com/">https://lore.kernel.org/qemu-devel/ZZfYvlmcxBCiaeWE@redhat.com/
87jzojbxt7.fsf@pond.sub.org/">https://lore.kernel.org/qemu-devel/87jzojbxt7.fsf@pond.sub.org/
87v883by34.fsf@pond.sub.org/">https://lore.kernel.org/qemu-devel/87v883by34.fsf@pond.sub.org/


If I read it right, the commit 46f7afa37096 example is pretty special that
the QOM path more or less decided more than the hierachy itself but changes
the existances of objects.


Let's see whether I got this...

We removed some useless objects, moved the useful ones to another home.
The move changed their QOM path.


They interrupt controller presenter objects were quite useful :)
From what I recall, we moved them from an array under the machine
to the CPU object, so the interrupt controller presenter states
previously under the machine were not there anymore and this broke
migration compatibility.

Sorry for the noise if this is not a problem anymore. It was at
the time and we found a way to work around it; I should probably
say we hacked our way around it. Nevertheless, this was kind of
a trauma too because since I never dared touch the QOM hierarchy
of a migratable machine again. Migration is complicated.

The problem was the removal of useless objects, because this also
removed their vmstate.

The fix was adding the vmstate back as a dummy.

The QOM patch changes are *not* part of the problem.

Correct?

No one wants
to be policing QOM hierarchy changes in every single series that shows
up on the list.

Anyway, thanks for the pointers. I'll study that code a bit more, maybe
I can come up with some way to handle these cases.

Hopefully between the analyze-migration test and the compat tests we'll
catch the next bug of this kind before it gets merged.


Things like that might be able to be detected via vmstate-static-checker.py.
But I'm not 100% sure, also its coverage is limited.

For example, I don't think it can detect changes to objects that will only
be created dynamically, e.g., I think sometimes we create objects after
some guest behaviors (consider guest enables the device, then QEMU
emulation creates some objects on demand of device setup?),


Feels nuts to me.

In real hardware, software enabling a device that is disabled by default
doesn't create the device.  The device is always there, it just happens
to be inactive unless enabled.  We should model the device just like
that.


yes. That's how we modeled the two interrupt modes in pseries. The
machine has two interrupt controller devices model always present
and the cpus, two interrupt presenters. SW negotiates with the
platform (QEMU) which mode to activate. This is the only way to
support migration with an OS that can choose such complex features.


For the context, POWER9 introduced a new flavor of HW logic for
interrupts, which scaled better on large system (16s) and guests
with newer OS could dynamically switch the SW interface to choose
the new implementation.

Thanks,

C.

                                                             and since the
static checker shouldn't ever start the VM and run any code, they won't
trigger.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv, (continued)

Prev by Date: Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv
Next by Date: Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv
Previous by thread: Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv
Next by thread: Re: [PATCH 00/33] hw/cpu/arm: Remove one use of qemu_get_cpu() in A7/A15 MPCore priv
Index(es):
- Date
- Thread