qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] spapr: Migrate CAS reboot flag


From: David Gibson
Subject: Re: [PATCH] spapr: Migrate CAS reboot flag
Date: Wed, 22 Jan 2020 17:50:28 +1100

On Tue, Jan 21, 2020 at 10:32:55AM +0100, Greg Kurz wrote:
> On Tue, 21 Jan 2020 14:43:32 +1100
> David Gibson <address@hidden> wrote:
> 
> > On Mon, Jan 20, 2020 at 09:04:38AM +0100, Greg Kurz wrote:
> > > On Fri, 17 Jan 2020 16:44:27 +0100
> > > Greg Kurz <address@hidden> wrote:
> > > 
> > > > On Fri, 17 Jan 2020 19:16:08 +1000
> > > > David Gibson <address@hidden> wrote:
> > > > 
> > > > > On Thu, Jan 16, 2020 at 07:29:02PM +0100, Greg Kurz wrote:
> > > > > > On Thu, 16 Jan 2020 13:14:35 +0100
> > > > > > Greg Kurz <address@hidden> wrote:
> > > > > > 
> > > > > > > On Thu, 16 Jan 2020 11:37:24 +0100
> > > > > > > Laurent Vivier <address@hidden> wrote:
> > > > > > > 
> > > > > > > > On 16/01/2020 09:48, Greg Kurz wrote:
> > > > > > > > > On Wed, 15 Jan 2020 19:10:37 +0100
> > > > > > > > > Laurent Vivier <address@hidden> wrote:
> > > > > > > > > 
> > > > > > > > >> Hi,
> > > > > > > > >>
> > > > > > > > >> On 15/01/2020 18:48, Greg Kurz wrote:
> > > > > > > > >>> Migration can potentially race with CAS reboot. If the 
> > > > > > > > >>> migration thread
> > > > > > > > >>> completes migration after CAS has set spapr->cas_reboot but 
> > > > > > > > >>> before the
> > > > > > > > >>> mainloop could pick up the reset request and reset the 
> > > > > > > > >>> machine, the
> > > > > > > > >>> guest is migrated unrebooted and the destination doesn't 
> > > > > > > > >>> reboot it
> > > > > > > > >>> either because it isn't aware a CAS reboot was needed (eg, 
> > > > > > > > >>> because a
> > > > > > > > >>> device was added before CAS). This likely result in a 
> > > > > > > > >>> broken or hung
> > > > > > > > >>> guest.
> > > > > > > > >>>
> > > > > > > > >>> Even if it is small, the window between CAS and CAS reboot 
> > > > > > > > >>> is enough to
> > > > > > > > >>> re-qualify spapr->cas_reboot as state that we should 
> > > > > > > > >>> migrate. Add a new
> > > > > > > > >>> subsection for that and always send it when a CAS reboot is 
> > > > > > > > >>> pending.
> > > > > > > > >>> This may cause migration to older QEMUs to fail but it is 
> > > > > > > > >>> still better
> > > > > > > > >>> than end up with a broken guest.
> > > > > > > > >>>
> > > > > > > > >>> The destination cannot honour the CAS reboot request from a 
> > > > > > > > >>> post load
> > > > > > > > >>> handler because this must be done after the guest is fully 
> > > > > > > > >>> restored.
> > > > > > > > >>> It is thus done from a VM change state handler.
> > > > > > > > >>>
> > > > > > > > >>> Reported-by: Lukáš Doktor <address@hidden>
> > > > > > > > >>> Signed-off-by: Greg Kurz <address@hidden>
> > > > > > > > >>> ---
> > > > > > > > >>>
> > > > > > > > >>
> > > > > > > > >> I'm wondering if the problem can be related with the fact 
> > > > > > > > >> that
> > > > > > > > >> main_loop_should_exit() could release qemu_global_mutex in
> > > > > > > > >> pause_all_vcpus() in the reset case?
> > > > > > > > >>
> > > > > > > > >> 1602 static bool main_loop_should_exit(void)
> > > > > > > > >> 1603 {
> > > > > > > > >> ...
> > > > > > > > >> 1633     request = qemu_reset_requested();
> > > > > > > > >> 1634     if (request) {
> > > > > > > > >> 1635         pause_all_vcpus();
> > > > > > > > >> 1636         qemu_system_reset(request);
> > > > > > > > >> 1637         resume_all_vcpus();
> > > > > > > > >> 1638         if (!runstate_check(RUN_STATE_RUNNING) &&
> > > > > > > > >> 1639                 !runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > > > > >> 1640             runstate_set(RUN_STATE_PRELAUNCH);
> > > > > > > > >> 1641         }
> > > > > > > > >> 1642     }
> > > > > > > > >> ...
> > > > > > > > >>
> > > > > > > > >> I already sent a patch for this kind of problem (in current 
> > > > > > > > >> Juan pull
> > > > > > > > >> request):
> > > > > > > > >>
> > > > > > > > >> "runstate: ignore finishmigrate -> prelaunch transition"
> > > > > > > > >>
> > > > > > > > > 
> > > > > > > > > IIUC your patch avoids an invalid 'prelaunch' -> 
> > > > > > > > > 'postmigrate' runstate
> > > > > > > > > transition that can happen if the migration thread sets the 
> > > > > > > > > runstate to
> > > > > > > > > 'finishmigrate' when pause_all_vcpus() releases the main loop 
> > > > > > > > > mutex.
> > > > > > > > > 
> > > > > > > > > ie. symptom of the problem is QEMU aborting, correct ? The 
> > > > > > > > > issue I'm
> > > > > > > > > trying to fix is a guest breakage caused by a discrepancy 
> > > > > > > > > between
> > > > > > > > > QEMU and the guest after migration has succeeded.
> > > > > > > > > 
> > > > > > > > >> but I don't know if it could fix this one.
> > > > > > > > >>
> > > > > > > > > 
> > > > > > > > > I don't think so and your patch kinda illustrates it. If the 
> > > > > > > > > runstate
> > > > > > > > > is 'finishmigrate' when returning from pause_all_vcpus(), 
> > > > > > > > > this means
> > > > > > > > > that state was sent to the destination before we could 
> > > > > > > > > actually reset
> > > > > > > > > the machine.
> > > > > > > > 
> > > > > > > > Yes, you're right.
> > > > > > > > 
> > > > > > > > But the question behind my comment was: is it expected to have 
> > > > > > > > a pending
> > > > > > > > reset while we are migrating?
> > > > > > > > 
> > > > > > > 
> > > > > > > Nothing prevents qemu_system_reset_request() to be called when 
> > > > > > > migration
> > > > > > > is active. 
> > > > > > > 
> > > > > > > > Perhaps H_CAS can return H_BUSY and wait the end of the 
> > > > > > > > migration and
> > > > > > > > then be fully executed on destination?
> > > > > > > > 
> > > > > > > 
> > > > > > > And so we would need to teach SLOF to try H_CAS again until it 
> > > > > > > stops
> > > > > > > returning H_BUSY ? It seems safer to migrate the CAS reboot flag 
> > > > > > > IMHO.
> > > > > > > 
> > > > > > 
> > > > > > Ok I've tried that with a patched SLOF that sleeps 500ms and tries 
> > > > > > CAS
> > > > > > again if H_BUSY was returned. It fixes the issue but it looks a bit
> > > > > > ugly because of the polling with an arbitrary timeout in SLOF... I'm
> > > > > > not very comfortable either with calling migration_is_active() from
> > > > > > the CAS code in QEMU.
> > > > > > 
> > > > > > David,
> > > > > > 
> > > > > > Any suggestion ?
> > > > > 
> > > > > Yeah, I think looping in SLOF is a worse idea than migrating the
> > > > > cas_reboot flag.
> > > > > 
> > > > > But.. a better solution still might be to just remove the remaining
> > > > > causes for CAS reboot entirely.  CAS reboots pretty much suck when
> > > > > they happen, anyway.
> > > > > 
> > > > 
> > > > I Agree.
> > > > 
> > > > > With the irq changeover condition removed, I think the remaining
> > > > > causes are more theoretical than practical situations at this point.
> > > > > 
> > > > 
> > > > FWIW, hotpluggging a PCI device before CAS result in a hung guest (not 
> > > > yet
> > > > investigated the details).
> > > 
> > > commit 10f12e6450407b18b4d5a6b50d3852dcfd7fff75
> > > Author: Daniel Henrique Barboza <address@hidden>
> > > Date:   Wed Aug 30 15:21:41 2017 -0300
> > > 
> > >     hw/ppc: CAS reset on early device hotplug
> > > 
> > > I'll have a look to see what can be done here.
> > 
> > Ah.. yes, that one might be a bit tricky.
> > 
> 
> So far it seems to be related to SLOF not being able to create
> new nodes in the DT when parsing the FDT returned by CAS. SLOF
> stops the parsing and returns an error. The guest ends up with
> a broken DT and eventually hangs later (in my case the kernel
> believes it is going to do hash while radix was negotiated with
> QEMU). I need to dig some more.

Uh... I don't think this is right.  I'm pretty sure SLOF *does* create
new nodes when parsing the CAS FDT, it will need to for
"ibm,dynamic-reconfiguration-memory" at least.

I've done some looking and I think the actual reasons here are a bit
more complicated (but also possibly easier to handle).

  1. We can't send hotplug events to the guest until after CAS,
     because before then we don't know if it can process them, or if
     it can, which interrupt it uses for them.

  2. Queueing hotplug events across CAS for delivery afterwards
     introduces other complications

  3. We need to make sure that each device appears exactly once in
     either the  initial device tree that the guest OS sees, *or* in a
     hotplug event, not both or neither.

Now that we rebuild the DT at CAS time, I think we mightd be able toy
handle this by converting such devices to "cold plugged" at CAS time
(similarly to what we already do at reset).  Since they're in the
CAS-time DT which is what the OS will consume, cold plugged is
effectively how the OS will see them.

A remaining problem is that new PCI devices won't get BARs assigned by
SLOF in this scenario.  We'll probably get away with it because of the
linux,pci-probe-only property, but I don't know we want to rely on
that.  PCI bridges hotplugged introduce further complications, because
they won't get enumerated.

> > > But I agree the other check is more theoretical:
> > > 
> > >     /* capabilities that have been added since CAS-generated guest reset.
> > >      * if capabilities have since been removed, generate another reset
> > >      */
> > >     spapr->cas_reboot = !spapr_ovec_subset(ov5_cas_old, spapr->ov5_cas);
> > > 
> > > Unless changing kernels or tempering with the kernel command line, I don't
> > > see how some capabilities could change between the two CAS in practice.
> > 
> > Well, we want to be robust and it's at least theoretically possible
> > that the guest will request different things on subsequent reboots.
> 
> Yes but in the latter case a full machine reset occurs and

Not necessarily.  A guest could ask for something on one CAS cycle,
then reject it on another, without doing a full reboot.  It'd be a
pointless thing for the guest to do, but it's possible.

> spapr->ov5_cas gets cleared, ie. spapr_ovec_subset() returns
> true in the check above no matter what.

Well, also it could happen if the guest rejects something we put in
the initial value of ov5_cas (which is populated from spapr->ov5, so
it's not entirely empty).

> > However I believe that the original rationale for this check was that
> > while we could add things to the device tree for added capabilities,
> > we didn't have a way to roll back the changes for removed
> > capabilities.
> > 
> 
> IIUC this is specifically for "removed capabilities since last
> CAS". This can happen if:
> 1) we're already processing a CAS reboot or,
> 2) a freshly rebooted guest calls CAS twice without being rebooted
>    in between.
> 
> Since a freshly booted or rebooted guest can only trigger a CAS
> reboot because of a "hotplug-before-CAS", if we manage to get rid
> of this limitation, 1) cannot happen anymore.
> 
> The linux kernel seems to be only calling "ibm,client-architecture-support"
> once during early boot so 2) should _never_ happen. Do we care to support
> this scenario anyway ?

I think you've missed some things in your reasoning.  But it doesn't
really matter because the full dt rebuilt should handle it anyway.  I
have a draft patch which removes this cause for CAS reboots.

Still grappling with the hotplug-before-CAS case.

> > Now that we fully rebuild the device tree at CAS, I think this test
> > can probably just go, although there's some double checking to do.
> > 
> 
> I tend to agree.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]