[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridg
From: |
David Gibson |
Subject: |
Re: [Qemu-ppc] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type |
Date: |
Fri, 7 Oct 2016 20:17:54 +1100 |
User-agent: |
Mutt/1.7.0 (2016-08-17) |
On Fri, Oct 07, 2016 at 04:34:59PM +1100, Alexey Kardashevskiy wrote:
> On 07/10/16 16:10, David Gibson wrote:
> > On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
> >> On 06/10/16 14:03, David Gibson wrote:
> >>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> >>> for a PAPR guest. Unlike on x86, it's routine on Power (both bare metal
> >>> and PAPR guests) to have numerous independent PHBs, each controlling a
> >>> separate PCI domain.
> >>>
> >>> There are two ways of configuring the spapr-pci-host-bridge device: first
> >>> it can be done fully manually, specifying the locations and sizes of all
> >>> the IO windows. This gives the most control, but is very awkward with 6
> >>> mandatory parameters. Alternatively just an "index" can be specified
> >>> which essentially selects from an array of predefined PHB locations.
> >>> The PHB at index 0 is automatically created as the default PHB.
> >>>
> >>> The current set of default locations causes some problems for guests with
> >>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> >>> GPGPU cards via VFIO). Obviously, for migration we can only change the
> >>> locations on a new machine type, however.
> >>>
> >>> This is awkward, because the placement is currently decided within the
> >>> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> >>> machine type version.
> >>>
> >>> So, this patch delegates the "default mode" PHB placement from the
> >>> spapr-pci-host-bridge device back to the machine type via a public method
> >>> in sPAPRMachineClass. It's still a bit ugly, but it's about the best we
> >>> can do.
> >>>
> >>> For now, this just changes where the calculation is done. It doesn't
> >>> change the actual location of the host bridges, or any other behaviour.
> >>>
> >>> Signed-off-by: David Gibson <address@hidden>
> >>> ---
> >>> hw/ppc/spapr.c | 34 ++++++++++++++++++++++++++++++++++
> >>> hw/ppc/spapr_pci.c | 22 ++++++++--------------
> >>> include/hw/pci-host/spapr.h | 11 +----------
> >>> include/hw/ppc/spapr.h | 4 ++++
> >>> 4 files changed, 47 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 03e3803..f6e9c2a 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList
> >>> *spapr_query_hotpluggable_cpus(MachineState *machine)
> >>> return head;
> >>> }
> >>>
> >>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>> + uint64_t *buid, hwaddr *pio, hwaddr
> >>> *pio_size,
> >>> + hwaddr *mmio, hwaddr *mmio_size,
> >>> + unsigned n_dma, uint32_t *liobns, Error
> >>> **errp)
> >>> +{
> >>> + const uint64_t base_buid = 0x800000020000000ULL;
> >>> + const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> >>> + const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> >>> + const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> >>> + const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> >>> + const uint32_t max_index = 255;
> >>> +
> >>> + hwaddr phb_base;
> >>> + int i;
> >>> +
> >>> + if (index > max_index) {
> >>> + error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> >>> + max_index);
> >>> + return;
> >>> + }
> >>> +
> >>> + *buid = base_buid + index;
> >>> + for (i = 0; i < n_dma; ++i) {
> >>> + liobns[i] = SPAPR_PCI_LIOBN(index, i);
> >>> + }
> >>> +
> >>> + phb_base = phb0_base + index * phb_spacing;
> >>> + *pio = phb_base + pio_offset;
> >>> + *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> >>> + *mmio = phb_base + mmio_offset;
> >>> + *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> >>> +}
> >>> +
> >>> static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >>> {
> >>> MachineClass *mc = MACHINE_CLASS(oc);
> >>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass
> >>> *oc, void *data)
> >>> mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >>> fwc->get_dev_path = spapr_get_fw_dev_path;
> >>> nc->nmi_monitor_handler = spapr_nmi;
> >>> + smc->phb_placement = spapr_phb_placement;
> >>> }
> >>>
> >>> static const TypeInfo spapr_machine_info = {
> >>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>> index 4f00865..c0fc964 100644
> >>> --- a/hw/ppc/spapr_pci.c
> >>> +++ b/hw/ppc/spapr_pci.c
> >>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev,
> >>> Error **errp)
> >>> sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >>>
> >>> if (sphb->index != (uint32_t)-1) {
> >>> - hwaddr windows_base;
> >>> + sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> >>> + Error *local_err = NULL;
> >>>
> >>> if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] !=
> >>> (uint32_t)-1)
> >>> || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported
> >>> == 2)
> >>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev,
> >>> Error **errp)
> >>> return;
> >>> }
> >>>
> >>> - if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> >>> - error_setg(errp, "\"index\" for PAPR PHB is too large (max
> >>> %u)",
> >>> - SPAPR_PCI_MAX_INDEX);
> >>> + smc->phb_placement(spapr, sphb->index,
> >>> + &sphb->buid, &sphb->io_win_addr,
> >>> &sphb->io_win_size,
> >>> + &sphb->mem_win_addr, &sphb->mem_win_size,
> >>> + windows_supported, sphb->dma_liobn,
> >>> &local_err);
> >>> + if (local_err) {
> >>> + error_propagate(errp, local_err);
> >>> return;
> >>> }
> >>> -
> >>> - sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> >>> - for (i = 0; i < windows_supported; ++i) {
> >>> - sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> >>> - }
> >>> -
> >>> - windows_base = SPAPR_PCI_WINDOW_BASE
> >>> - + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> >>> - sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> >>> - sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> >>> }
> >>>
> >>> if (sphb->buid == (uint64_t)-1) {
> >>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >>> index 30dbd46..8c9ebfd 100644
> >>> --- a/include/hw/pci-host/spapr.h
> >>> +++ b/include/hw/pci-host/spapr.h
> >>> @@ -79,18 +79,9 @@ struct sPAPRPHBState {
> >>> uint32_t numa_node;
> >>> };
> >>>
> >>> -#define SPAPR_PCI_MAX_INDEX 255
> >>> -
> >>> -#define SPAPR_PCI_BASE_BUID 0x800000020000000ULL
> >>> -
> >>> #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> >>>
> >>> -#define SPAPR_PCI_WINDOW_BASE 0x10000000000ULL
> >>> -#define SPAPR_PCI_WINDOW_SPACING 0x1000000000ULL
> >>> -#define SPAPR_PCI_MMIO_WIN_OFF 0xA0000000
> >>> -#define SPAPR_PCI_MMIO_WIN_SIZE (SPAPR_PCI_WINDOW_SPACING - \
> >>> - SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> >>> -#define SPAPR_PCI_IO_WIN_OFF 0x80000000
> >>> +#define SPAPR_PCI_MMIO_WIN_SIZE 0xf80000000
> >>> #define SPAPR_PCI_IO_WIN_SIZE 0x10000
> >>>
> >>> #define SPAPR_PCI_MSI_WINDOW 0x40000000000ULL
> >>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>> index 39dadaa..9e1c5a5 100644
> >>> --- a/include/hw/ppc/spapr.h
> >>> +++ b/include/hw/ppc/spapr.h
> >>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
> >>> bool dr_lmb_enabled; /* enable dynamic-reconfig/hotplug of
> >>> LMBs */
> >>> bool use_ohci_by_default; /* use USB-OHCI instead of XHCI */
> >>> const char *tcg_default_cpu; /* which (TCG) CPU to simulate by
> >>> default */
> >>> + void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> >>> + uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> >>> + hwaddr *mmio, hwaddr *mmio_size,
> >>> + unsigned n_dma, uint32_t *liobns, Error
> >>> **errp);
> >>
> >>
> >> I still like idea of getting rid of index better. In order to reduce the
> >> command line length, you could not enable IO and MMIO32 by default, the
> >> default size for MMIO64 could be set to 1TB or something like this, BUID
> >> could be just a raw MMIO64 address value.
> >
> > Hm, interesting idea. I do like the idea of making the BUID equal to
> > the MMIO64 address. Obviously we'd still need addresses for the
> > default PHB, including 32-bit and PIO windows.
> >
> > Trickier, we'd still need to support 'index' for compatibility with
> > older command lines. I guess we could reserve index purely for the
> > "legacy" locations - 2.8+ machine types would create the default
> > bridge with explicit windows, rather than just using index 0.
> >
> >> So essentially you would only have to specify MMIO64 and 2xLIOBN in the
> >> command line which does not seem that much of overkill assuming that a
> >> second (third,...) PHB is almost never used and even if it is, it will be
> >> used for SRIOV VF (which can do 64bit) or virtio (the same).
> >
> > Hm, maybe. 3 values is still a bit of a PITA. I guess liobn64 could
> > theoretically be optional, although you're going to want it for
> > basically every case where you're creating an additional PHB.
> >
> > Or.. what if we insisted the mmio64 window had at least 4G alignment,
> > then we might be able to use mmio64 >> 32 (== buid >> 32) as the
> > liobn. Hmm.. guess it would need to be 8G alignment, so we have one
> > extra bit.
>
>
> As you mentioned in 4/4, guests cannot access above 64 TiB, we could use
> upper bits to store the window number.
*Some* guests can't access above 64 TiB, that's not an inherent
limitation. So I'd rather not do that.
>
>
> >
> > ..and I wonder if this might be easier to manage if we introduced a
> > new spapr-pci-host-bridge-2 subtype.
> >
> > Let me think about this..
> >
> >> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
> >> addresses from the command line parameters which did hit us with migration
> >> before as the QEMU command line on the source side and the destination side
> >> is not exactly the same, and this might hit again...
> >
> > Uh.. I'm not quite following you. What's the situation which would
> > break migration with the proposed scheme?
>
> Well, I cannot think of any other device available on SPAPR which we could
> map to the CPU address space but if it ever appears, we will have a problem
> unless this new imaginary device does not pick an address from the CPU
> space automatically.
I followed that even less that the previous one, I can't even parse that.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
[Qemu-ppc] [RFC 3/4] spapr_pci: Add a 64-bit MMIO window, David Gibson, 2016/10/05
[Qemu-ppc] [RFC 4/4] spapr: Improved placement of PCI host bridges in guest memory map, David Gibson, 2016/10/05
[Qemu-ppc] [RFC 2/4] spapr: Adjust placement of PCI host bridge to allow > 1TiB RAM, David Gibson, 2016/10/05
Re: [Qemu-ppc] [Qemu-devel] [RFC 0/4] Improve PCI IO window orgnaization for pseries, no-reply, 2016/10/10