qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ways to deal with broken machine types


From: Daniel P . Berrangé
Subject: Re: Ways to deal with broken machine types
Date: Tue, 23 Mar 2021 17:40:36 +0000
User-agent: Mutt/2.0.5 (2021-01-21)

On Tue, Mar 23, 2021 at 05:54:47PM +0100, Igor Mammedov wrote:
> Let me hijack this thread for beyond this case scope.
> 
> I agree that for this particular bug we've done all we could, but
> there is broader issue to discuss here.
> 
> We have machine versions to deal with hw compatibility issues and that covers 
> most of the cases,
> but occasionally we notice problem well after release(s),
> so users may be stuck with broken VM and need to manually fix configuration 
> (and/or VM).
> Figuring out what's wrong and how to fix it is far from trivial. So lets 
> discuss if we
> can help to ease this pain, yes it will be late for first victims but it's 
> still
> better than never.

To summarize the problem situation

 - We rely on a machine type version to encode a precise guest ABI.
 - Due a bug, we are in a situation where the same machine type
   encodes two distinct guest ABIs due to a mistake introduced
   betwen QEMU N-2 and N-1
 - We want to fix the bug in QEMU N
 - For incoming migration there is no way to distinguish between
   the ABIs used in N-2 and N-1, to pick the right one

So we're left with an unwinnable problem:

  - Not fixing the bug =>

       a) user migrating N-2 to N-1 have ABI change
       b) user migrating N-2 to N have ABI change
       c) user migrating N-1 to N are fine

    No mitigation for (a) or (b)

  - Fixing the bug =>

       a) user migrating N-2 to N-1 have ABI change.
       b) user migrating N-2 to N are fine
       c) user migrating N-1 to N have ABI change

    Bad situations (a) and (c) are mitigated by
    backporting fix to N-1-stable too.

Generally we have preferred to fix the bug, because we have
usually identified them fairly quickly after release, and
backporting the fix to stable has been sufficient mitigation
against ill effects. Basically the people left broken are a
relatively small set out of the total userbase.

The real challenge arises when we are slow to identify the
problem, such that we have a large number of people impacted.


> I'll try to sum up idea Michael suggested (here comes my unorganized 
> brain-dump),
> 
> 1. We can keep in VM's config QEMU version it was created on
>    and as minimum warn user with a pointer to known issues if version in
>    config mismatches version of actually used QEMU, with a knob to silence
>    it for particular mismatch.
> 
> When an issue becomes know and resolved we know for sure how and what
> changed and embed instructions on what options to use for fixing up VM's
> config to preserve old HW config depending on QEMU version VM was installed 
> on.

> some more ideas:
>    2. let mgmt layer to keep fixup list and apply them to config if available
>        (user would need to upgrade mgmt or update fixup list somehow)
>    3. let mgmt layer to pass VM's QEMU version to currently used QEMU, so
>       that QEMU could maintain and apply fixups based on QEMU version + 
> machine type.
>       The user will have to upgrade to newer QEMU to get/use new fixups.

The nice thing about machine type versioning is that we are treating the
versions as opaque strings which represent a specific ABI, regardless of
the QEMU version. This means that even if distros backport fixes for bugs
or even new features, the machine type compatibility check remains a
simple equality comparsion.

As soon as you introduce the QEMU version though, we have created a
large matrix for compatibility. This matrix is expanded if a distro
chooses to backport fixes for any of the machine type bugs to their
stable streams. This can get particularly expensive when there are
multiple streams a distro is maintaining.

*IF* the original N-1 qemu has a property that could be queried by
the mgmt app to identify a machine type bug, then we could potentially
apply a fixup automatically.

eg query-machines command in QEMU version N could report against
"pc-i440fx-5.0", that there was a regression fix that has to be
applied if property "foo" had value "bar".

Now, the mgmt app wants to migrate from QEMU N-2 or N-1 to QEMU N.
It can query the value of "foo" on the source QEMU with qom-get.
It now knows whether it has to override this property "foo" when
spawning QEMU N on the target host.

Of course this doesn't help us if neither N-1 or N-2 QEMU had a
property that can be queried to identify the bug - ie if the
property in question was newly introduced in QEMU N to fix the
bug.

> In my opinion both would lead to explosion of 'possibly needed' properties 
> for each
> change we introduce in hw/firmware(read ACPI) and very possibly a lot of 
> conditional
> branches in QEMU code. And I'm afraid it will become hard to maintain QEMU =>
> more bugs in future.
> Also it will lead to explosion of test matrix for downstreams who care about 
> testing.
> 
> If we proactively gate changes on properties, we can just update fixup lists 
> in mgmt,
> without need to update QEMU (aka Insite rules) at a cost of complexity on 
> QMEU side.
> 
> Alternatively we can be conservative in spawning new properties, that means 
> creating
> them only when issue is fixed and require users to update QEMU, so that 
> fixups could
> be applied to VM.
> 
> Feel free to shoot the messenger down or suggest ways how we can deal with 
> the problem.

The best solution is of course to not have introduced the ABI change in
the first place. We have lots of testing, but upstream at least, I don't
think we have anything that is explicitly recording the ABI associated
with each machine type and validating that it hasn't changed. We rely on
the developers to follow the coding practices wrt setting machine type
defaults for back compat, and while we're good, we inevitably screw up
every now & then.

Downstreams do have some of this ABI testing - several problems like the
one we have there, have been identified when RHEL downstream QE did
migration tests and found a change in RHEL machine types, which then
was traced back to upstream.

I feel like we need some standard tool which can be run inside a VM
that dumps all the possible ABI relevant information about the virtual
machine in a nice data format.

We would have to run this for each machine type, and save the
results to git immediately after release. Then for every change to
master, we would have to run the test again for every historic
machine type version and compare to the recorded ABI record.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]