[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] Problem with "savevm" on ppc64
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-ppc] Problem with "savevm" on ppc64 |
Date: |
Fri, 21 Oct 2016 10:12:28 +0100 |
User-agent: |
Mutt/1.7.1 (2016-10-04) |
* David Gibson (address@hidden) wrote:
> On Thu, Oct 20, 2016 at 03:17:12PM +0200, Thomas Huth wrote:
> > Hi all,
> >
> > I'm currently facing a strange problem with the "savevm" HMP command on
> > ppc64 with TCG and the pseries machine. Steps for reproduction:
> >
> > 1) Create a disk image:
> > qemu-img create -f qcow2 /tmp/test.qcow2 1M
> >
> > 2) Start QEMU (in TCG mode):
> > qemu-system-ppc64 -nographic -vga none -m 256 -hda /tmp/test.qcow2
> >
> > 3) Hit "CTRL-a c" to enter the HMP monitor
> >
> > 4) Type "savevm test1" to save a snapshot
> >
> > The savevm command then hangs forever and the test.qcow2 image keeps
> > growing and growing.
> >
> > It seems like qemu_savevm_state_iterate() does not make any more
> > progress because ram_save_iterate() keeps returning 0 ... but why can
> > that happen?
>
> Hmm. You don't mention installing anything on the disk image, so I'm
> assuming the VM is just sitting in SLOF, unable to boot. And it's
> TCG, so there's no VRMA. Which mans I'd expect the HPT to be
> completely empty.
>
> I strongly suspect that's what's triggering the problem, although I'm
> not quite sure of the mechanism
>
> > I've tinkered with the code a little bit, and I can get it to work with
> > the following patch - which however is quite independent of the
> > ram_save_iterate() code:
> >
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index ddb7438..a7ac0bf 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -1290,7 +1290,7 @@ static int htab_save_setup(QEMUFile *f, void *opaque)
> > return 0;
> > }
> >
> > -static void htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > +static int htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > int64_t max_ns)
> > {
> > bool has_timeout = max_ns != -1;
> > @@ -1340,6 +1340,8 @@ static void htab_save_first_pass(QEMUFile *f,
> > sPAPRMachineState *spapr,
> > spapr->htab_first_pass = false;
> > }
> > spapr->htab_save_index = index;
> > +
> > + return !spapr->htab_first_pass;
> > }
> >
> > static int htab_save_later_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > @@ -1444,7 +1446,7 @@ static int htab_save_iterate(QEMUFile *f, void
> > *opaque)
> > return rc;
> > }
> > } else if (spapr->htab_first_pass) {
> > - htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> > + rc = htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> > } else {
> > rc = htab_save_later_pass(f, spapr, MAX_ITERATION_NS);
> > }
> >
> > That means, if htab_save_iterate() does not initially return a 0, then
> > ram_save_iterate() also does not run into the condition of returning
> > zeroes later. But how are these related? And is it safe for returning a
> > non-zero value in htab_save_first_pass()? Does anybody with more
> > experience in this area has a clue what's going on here?
>
> So, looking at this I think it's unsafe. htab_save_first_pass() never
> examines dirty bits, so we could get:
> htab_save_first_pass() called once, saves part of HPT
> guest dirties an HPTE in the already saved region
> enter migration completion stage
> htab_save_first_pass() saves the remainder of the HPT, returns 1
>
> That would trigger the code to think the HPT migration is complete,
> without ever saving the HPTE that got dirtied part way through.
>
> But.. then I looked further and got *really* confused.
Ewww, my head.....this is very confusing.
> The comment above qemu_savevm_state_iterate() and the logic in
> qemu_savevm_state() say that the iterate function returns >0 to
> indicate that it is done and we can move onto the completion phase.
>
> But both ram_save_iterate() and block_save_iterate() seem to have that
> inverted: they return >0 if they actually saved something.
I think you're right that logic looks inverted somewhere, but be careful
that we have:
a) The return value of qemu_savevm_state_iterate
b) The return value of the iterators it calls
Both those return values individually make sense *if* the logic
within qemu_savevm_state_iterate did the right thing; which I'm not
convinced of.
> I think the only reason migration has been working *at all* is that
> migration_thread() and qemu_savevm_state_complete_precopy() ignore the
> return value of the iterate() function (except for <0 errors).
>
> qemu_savevm_state(), however, does not.
However, there are two other reasons it works:
a) We keep iterating in the migration code until the amount of remaining
iterable RAM/etc is large (i.e. result of qemu_savevm_state_pending).
b) Even if we never called the _iterate function the stream should be
correct because the _complete function should ensure that anything
left is migrated. I don't *think* this can happen in a normal migration
but hmm maybe if you set a very large downtime.
> So, I don't think this is actually a Power bug. But, it's masked on
> machine types other than pseries, because they will almost never have
> >1 iterable state handler.
Well, they have it in the case of qemu block migration, but I guess that's
probably only used in migration not savevm.
> I think what must be happening is that
> because we don't have time limits in the savevm case, the first call
> to ram_save_iterate() from qemu_savevm_state_iterate() is saving all
> of RAM. On x86 that would be the end of the QTAILQ_FOREACH(), we'd
> return non-zero and that would kick qemu_savevm_state() into the
> completion phase.
>
> But on Power, we save all RAM, then move onto the HPT iterator. It
> returns 0 on the first pass, and we loop again in
> qemu_savevm_state(). But on the next call, RAM is all saved, so
> ram_save_iterate now returns 0. That breaks us out of the FOREACH in
> qemu_savevm_state_iterate(), returning 0. Repeat forever.
>
> Juan, Dave, sound right?
I think so.
> I suspect the right fix is to make ram_save_iterate() return void.
> Migration will use the bandwidth / downtime calculations to determine
> completion as always. savevm would skip the "live" phase entirely and
> move directly on to completion phase unconditionally.
Skipping the live phase sounds OK, haven't got a clue if any other iterators
will like that. but then why did qemu_savevm_state() ever bother calling
the iterate functions?
However, there's the comment near the end of qemu_savevm_state_iterate
which I don't think we're obeying (because of the disagreement on return types)
but the comment sounds a reasonable goal:
/* Do not proceed to the next vmstate before this one reported
completion of the current stage. This serializes the migration
and reduces the probability that a faster changing state is
synchronized over and over again. */
I think the logic here is that if the return value worked the way the
comment at the top of the function, and htab_iterator thinks, then if you had
a bandwidth limit, you'd end up repeatedly calling ram_save_iterate and never
calling the block iterator until RAM was done, then you'd do some block;
and maybe come back to RAM, but the iteration would always prefer the
items near the top of the list of iterators.
Dave
> --
> David Gibson | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_
> _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK