qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 6/6] spapr: Model DR connectors as simple objects


From: David Gibson
Subject: Re: [PATCH 6/6] spapr: Model DR connectors as simple objects
Date: Mon, 8 Feb 2021 17:30:23 +1100

On Wed, Jan 06, 2021 at 07:15:36PM +0100, Greg Kurz wrote:
> On Mon, 28 Dec 2020 19:28:39 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Dec 18, 2020 at 11:34:00AM +0100, Greg Kurz wrote:
> > > Modeling DR connectors as individual devices raises some
> > > concerns, as already discussed a year ago in this thread:
> > > 
> > > https://patchew.org/QEMU/20191017205953.13122-1-cheloha@linux.vnet.ibm.com/
> > > 
> > > First, high maxmem settings creates too many DRC devices.
> > > This causes scalability issues. It severely increase boot
> > > time because the multiple traversals of the DRC list that
> > > are performed during machine setup are quadratic operations.
> > > This is directly related to the fact that DRCs are modeled
> > > as individual devices and added to the composition tree.
> > > 
> > > Second, DR connectors are really an internal concept of
> > > PAPR. They aren't something that the user or management
> > > layer can manipulate in any way. We already don't allow
> > > their creation with device_add by clearing user_creatable.
> > > 
> > > DR connectors don't even need to be modeled as actual
> > > devices since they don't sit in a bus. They just need
> > > to be associated to an 'owner' object and to have the
> > > equivalent of realize/unrealize functions.
> > > 
> > > Downgrade them to be simple objects. Convert the existing
> > > realize() and unrealize() to be methods of the DR connector
> > > base class. Also have the base class to inherit from the
> > > vmstate_if interface directly. The get_id() hook simply
> > > returns NULL, just as device_vmstate_if_get_id() does for
> > > devices that don't sit in a bus. The DR connector is no
> > > longer made a child object. This means that it must be
> > > explicitely freed when no longer needed. This is only
> > > required for PHBs and PCI bridges actually : have them to
> > > free the DRC with spapr_dr_connector_free() instead of
> > > object_unparent().
> > > 
> > > No longer add the DRCs to the QOM composition tree. Track
> > > them with a glib hash table using the global DRC index as
> > > the key instead. This makes traversal a linear operation.
> > 
> > I have some reservations about this one.  The main thing is that
> > attaching migration state to something that's not a device seems a bit
> > odd to me.  AFAICT exactly one other non-device implements
> > TYPE_VMSTATE_IF, and what it does isn't very clear to me.
> > 
> 
> Even with your proposal below, the current SpaprDrc type, which is
> used all over the place, will stop being a TYPE_DEVICE but we still
> need to support migration with existing machine types for which DRC
> are devices.

Ah... that's a good point.

> Implementing TYPE_VMSTATE_IF is essentially a hack that
> allows to do that without keeping the current TYPE_DEVICE based
> implementation around.

Ok, that makes things clearer.

> > As I might have mentioned to you I had a different idea for how to
> > address this problem: still use a TYPE_DEVICE, but have it manage a
> > whole array of DRCs as one unit, rather than just a single one.
> > Specifically I was thinking:
> > 
> > * one array per PCI bus (DRCs for each function on the bus)
> > * one array for each block of memory (so one for base memory, one for
> >   each DIMM)
> > * one array for all the cpus
> > * one array for all the PHBs
> > 
> > It has some disadvantages compared to your scheme: it still leaves
> > (less) devices which can't be user managed, which is a bit ugly.  On
> > the other hand, each of those arrays can reasonably be dense, so we
> > can use direct indexing rather than a hash table, which is a bit
> > nicer.
> > 
> > Thoughts?
> > 
> 
> I find it a bit overkill to introduce a new TYPE_DEVICE (let's
> call it a DRC manager) for something that:
> - doesn't sit on a bus
> - can't be user managed
> - isn't directly represented to the guest as a full node
>   in the DT unlike all other devices, but just as indexes
>   in some properties of actual DR capable devices.
> 
> Given that the DRC index space is global and this is what
> the guest passes to DR RTAS calls, we can't do direct
> indexing, strictly speaking. We need at least some logic
> to dispatch operations on individual DRC states to the
> appropriate DRC manager. This logic belongs to the machine
> IMHO.
> 
> This shouldn't be too complex for CPUs and PHBs since they
> sit directly under the machine and have 1:1 relation with
> the attached device. It just boils down to instantiate
> some DRC managers during machine init:
> 
> - range [ 0x10000000 ... 0x10000000 + ${max_cpus} [
>   for CPUs
> - range [ 0x20000000 ... 0x20000000 + 31 [
>   for PHBs
> 
> For memory, the code currently generates DRC indexes in the range:
> 
> [ 0x80000000 ... 0x80000000 + ${base_ram_size}/256M ... ${max_ram_size}/256M [
> 
> ie. it doesn't generate DRC indexes for the base memory AFAICT. Also
> each DIMM can be of arbitrary size, ie. consume an arbitrary amount
> of DRC indexes. So the machine would instantiate SPAPR_MAX_RAM_SLOTS (32)
> DRC managers, each capable of managing the full set of LMB DRCs, just
> in case ? Likely a lot of zeroes with high maxmem settings but I guess
> we can live with it.

Actually, I was thinking of just a single manager for all the
(pluggable) LMB DRCs, a single manager for all CPU DRCs, a single
manager for all PHB DRCs and one per bus for PCI DRCs.  I'm not
assuming a 1:1 correspondance between manager and user side hotplug
operations.

Although... actually the "manager" could be an interface rather than
an object, in which case the DRC manager would be the machine for
LMBs, CPUs, and PHBs and the parent bus for each PCI slot.

> PCI busses would need some extra care though since the machine
> doesn't know about them. This would require to be able to
> register/unregister DRC managers for SPAPR_DR_CONNECTOR_TYPE_PCI
> indexes, so that the dispatching logic know about the ranges
> they cover (PHB internals).

Right, but that wouldn't really be any different from the dynamic
creation of DRCs we do add_drcs() / remove_drcs() right now, except
that it would create /destroy one object instead of a bunch.

> And finally comes migration : I cannot think of a way to generate
> the VMState sections used by existing machine types out of a set
> of arrays of integers... We could keep the current implementation
> around and use it with older machine types, but this certainly
> looks terrible from a maintenance perspective. Did you have any
> suggestion to handle that ?

Ugh, yeah.. that could be difficult.

> I seem to remember that one of the motivation to have arrays
> of DRCs is to avoid the inflation of VMState sections that
> we currently get with high maxmem settings, and it is considered
> preferable to stream sparse arrays. This could be achieved by
> building these arrays out of the global DRC hash table in a machine
> pre-save handler and migrate them in a subsection for the default
> machine type. Older machine types would continue with the current
> VMState sections thanks to the TYPE_VMSTATE_IF hack.
> 
> Does this seem a reasonable trade-off to be able to support
> older and newer machine types with the same implementation ?

Hrm, maybe, yeah.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]