Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device

qemu-ppc
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device

From:	David Gibson
Subject:	Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device
Date:	Fri, 22 Nov 2019 16:11:21 +1100
User-agent:	Mutt/1.12.1 (2019-06-15)
On Mon, Oct 14, 2019 at 01:38:16PM -0500, Shivaprasad G Bhat wrote:
> This patch implements few of the necessary hcalls for the nvdimm support.
> 
> PAPR semantics is such that each NVDIMM device is comprising of multiple
> SCM(Storage Class Memory) blocks. The guest requests the hypervisor to
> bind each of the SCM blocks of the NVDIMM device using hcalls. There can
> be SCM block unbind requests in case of driver errors or unplug(not
> supported now) use cases. The NVDIMM label read/writes are done through
> hcalls.
> 
> Since each virtual NVDIMM device is divided into multiple SCM blocks,
> the bind, unbind, and queries using hcalls on those blocks can come
> independently. This doesn't fit well into the qemu device semantics,
> where the map/unmap are done at the (whole)device/object level granularity.
> The patch doesnt actually bind/unbind on hcalls but let it happen at the
> device_add/del phase itself instead.
> 
> The guest kernel makes bind/unbind requests for the virtual NVDIMM device
> at the region level granularity. Without interleaving, each virtual NVDIMM

It's not clear to me what a "region" means in this context.

> device is presented as separate region. There is no way to configure the
> virtual NVDIMM interleaving for the guests today. So, there is no way a
> partial bind/unbind request can come for the vNVDIMM in a hcall for a
> subset of SCM blocks of a virtual NVDIMM. Hence it is safe to do
> bind/unbind everything during the device_add/del.
> 
> Signed-off-by: Shivaprasad G Bhat <address@hidden>
> ---
> ---
>  hw/ppc/spapr_hcall.c   |  300 
> ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h |    8 +
>  2 files changed, 307 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 23e4bdb829..4e9ad96f7c 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c

This is large enough and sufficiently non-core that I'd suggest
putting it in its own file.

> @@ -18,6 +18,10 @@
>  #include "hw/ppc/spapr_ovec.h"
>  #include "mmu-book3s-v3.h"
>  #include "hw/mem/memory-device.h"
> +#include "hw/ppc/spapr_drc.h"
> +#include "hw/mem/nvdimm.h"
> +#include "qemu/range.h"
> +#include "qemu/nvdimm-utils.h"
>  
>  static bool has_spr(PowerPCCPU *cpu, int spr)
>  {
> @@ -1961,6 +1965,295 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, 
> SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static target_ulong h_scm_read_metadata(PowerPCCPU *cpu,
> +                                        SpaprMachineState *spapr,
> +                                        target_ulong opcode,
> +                                        target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t offset = args[1];
> +    uint64_t numBytesToRead = args[2];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    NVDIMMDevice *nvdimm;
> +    NVDIMMClass *ddc;
> +    __be64 data_be = 0;
> +    uint64_t data = 0;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

I'm pretty sure you want if (!drc || ... ) here.  Otherwise if the
given index is totally bogus you'll crash a few lines down.

> +        return H_PARAMETER;
> +    }
> +
> +    if (numBytesToRead != 1 && numBytesToRead != 2 &&
> +        numBytesToRead != 4 && numBytesToRead != 8) {
> +        return H_P3;
> +    }
> +
> +    nvdimm = NVDIMM(drc->dev);

What if the drc slot is not plugged at present (so drc->dev is NULL)?

> +    if ((offset + numBytesToRead < offset) ||
> +        (nvdimm->label_size < numBytesToRead + offset)) {
> +        return H_P2;
> +    }
> +
> +    ddc = NVDIMM_GET_CLASS(nvdimm);
> +    ddc->read_label_data(nvdimm, &data_be, numBytesToRead, offset);
> +
> +    switch (numBytesToRead) {
> +    case 1:
> +        data = data_be & 0xff;

The read_label_data above is only filling in the first byte of
data_be.  That only corresponds to data_be & 0xff if the host is
little endian which you shouldn't rely on.

Also, I'm not sure in what sense "data_be" is big-endian as the name
suggests.  AFAICT the label data is a byte-addressable array, which
means it doesn't really have an endianness.

I think what you're trying to do here is put the 0th byte of label
into the LSB of the return value ... the 7th byte of label data into
the MSB of the return value.

From the point of view of the 64-bit value you're returning in a
register, that makes the label data effectively *LE* (LSB first), not
BE.  Since you're initializing data_be to 0, you can just load into it
and le64_to_cpu() without reference to numBytesToRead to accomplish
this.

> +        break;
> +    case 2:
> +        data = be16_to_cpu(data_be & 0xffff);
> +        break;
> +    case 4:
> +        data = be32_to_cpu(data_be & 0xffffffff);
> +        break;
> +    case 8:
> +        data = be64_to_cpu(data_be);
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    args[0] = data;
> +
> +    return H_SUCCESS;
> +}
> +
> +static target_ulong h_scm_write_metadata(PowerPCCPU *cpu,
> +                                         SpaprMachineState *spapr,
> +                                         target_ulong opcode,
> +                                         target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t offset = args[1];
> +    uint64_t data = args[2];
> +    uint64_t numBytesToWrite = args[3];
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    NVDIMMDevice *nvdimm;
> +    DeviceState *dev;
> +    NVDIMMClass *ddc;
> +    __be64 data_be = 0;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

Same error for !drc as noted above.

> +        return H_PARAMETER;
> +    }
> +
> +    if (numBytesToWrite != 1 && numBytesToWrite != 2 &&
> +        numBytesToWrite != 4 && numBytesToWrite != 8) {
> +        return H_P4;
> +    }
> +
> +    dev = drc->dev;
> +    nvdimm = NVDIMM(dev);
> +
> +    switch (numBytesToWrite) {
> +    case 1:
> +        if (data & 0xffffffffffffff00) {
> +            return H_P2;
> +        }
> +        data_be = data & 0xff;
> +        break;
> +    case 2:
> +        if (data & 0xffffffffffff0000) {
> +            return H_P2;
> +        }
> +        data_be = cpu_to_be16(data & 0xffff);
> +        break;
> +    case 4:
> +        if (data & 0xffffffff00000000) {
> +            return H_P2;
> +        }
> +        data_be = cpu_to_be32(data & 0xffffffff);
> +        break;
> +    case 8:
> +        data_be = cpu_to_be64(data);
> +        break;
> +    default: /* lint */
> +            break;
> +    }

Euch, you can definitely find a less repetitive way to make these
checks than a big switch.  Also in mirror to the read function, AFAICT
what you actually want here is a cpu_to_le64().

> +
> +    ddc = NVDIMM_GET_CLASS(nvdimm);
> +    ddc->write_label_data(nvdimm, &data_be, numBytesToWrite, offset);
> +
> +    return H_SUCCESS;
> +}
> +
> +static target_ulong h_scm_bind_mem(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                                   target_ulong opcode, target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t starting_idx = args[1];
> +    uint64_t no_of_scm_blocks_to_bind = args[2];
> +    uint64_t target_logical_mem_addr = args[3];
> +    uint64_t continue_token = args[4];
> +    uint64_t size;
> +    uint64_t total_no_of_scm_blocks;
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    hwaddr addr;
> +    DeviceState *dev;
> +    PCDIMMDevice *dimm;
> +    Error *local_err = NULL;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

Check !drc case again.

> +        return H_PARAMETER;
> +    }
> +
> +    dev = drc->dev;

And the !drc->dev case.

> +    dimm = PC_DIMM(dev);
> +
> +    size = object_property_get_uint(OBJECT(dimm),
> +                                    PC_DIMM_SIZE_PROP, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return H_PARAMETER;
> +    }
> +
> +    total_no_of_scm_blocks = size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +
> +    if ((starting_idx > total_no_of_scm_blocks) ||
> +        (no_of_scm_blocks_to_bind > total_no_of_scm_blocks)) {

I think the second check is redundant with the check below.

> +        return H_P2;
> +    }
> +
> +    if (((starting_idx + no_of_scm_blocks_to_bind) < starting_idx) ||
> +        ((starting_idx + no_of_scm_blocks_to_bind) > 
> total_no_of_scm_blocks)) {
> +        return H_P3;
> +    }
> +
> +    /* Currently qemu assigns the address. */
> +    if (target_logical_mem_addr != 0xffffffffffffffff) {
> +        return H_OVERLAP;
> +    }
> +
> +    /*
> +     * Currently continue token should be zero qemu has already bound
> +     * everything and this hcall doesnt return H_BUSY.
> +     */
> +    if (continue_token > 0) {
> +        return H_P5;
> +    }
> +
> +    addr = object_property_get_uint(OBJECT(dimm),
> +                                    PC_DIMM_ADDR_PROP, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return H_PARAMETER;
> +    }
> +
> +    addr += starting_idx * SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +
> +    /* Already bound, Return target logical address in R4 */
> +    args[1] = addr;
> +    args[2] = no_of_scm_blocks_to_bind;
> +
> +    return H_SUCCESS;
> +}
> +
> +static target_ulong h_scm_unbind_mem(PowerPCCPU *cpu, SpaprMachineState 
> *spapr,
> +                                     target_ulong opcode, target_ulong *args)
> +{
> +    uint32_t drc_index = args[0];
> +    uint64_t starting_scm_logical_addr = args[1];
> +    uint64_t no_of_scm_blocks_to_unbind = args[2];
> +    uint64_t continue_token = args[3];
> +    uint64_t size_to_unbind;
> +    Range blockrange = range_empty;
> +    Range nvdimmrange = range_empty;
> +    SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +    DeviceState *dev;
> +    PCDIMMDevice *dimm;
> +    uint64_t size, addr;
> +
> +    if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {

And again.

> +        return H_PARAMETER;
> +    }
> +
> +    /* Check if starting_scm_logical_addr is block aligned */
> +    if (!QEMU_IS_ALIGNED(starting_scm_logical_addr,
> +                         SPAPR_MINIMUM_SCM_BLOCK_SIZE)) {
> +        return H_P2;
> +    }
> +
> +    dev = drc->dev;

And again the drc->dev case.

> +    dimm = PC_DIMM(dev);
> +    size = object_property_get_int(OBJECT(dimm), PC_DIMM_SIZE_PROP, NULL);
> +    addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, NULL);
> +
> +    range_init_nofail(&nvdimmrange, addr, size);
> +
> +    size_to_unbind = no_of_scm_blocks_to_unbind * 
> SPAPR_MINIMUM_SCM_BLOCK_SIZE;

Check for overflow in this multiply.

> +
> +
> +    range_init_nofail(&blockrange, starting_scm_logical_addr, 
> size_to_unbind);
> +
> +    if (!range_contains_range(&nvdimmrange, &blockrange)) {
> +        return H_P3;
> +    }
> +
> +    /* continue_token should be zero as this hcall doesn't return H_BUSY. */
> +    if (continue_token > 0) {
> +        return H_P3;
> +    }
> +
> +    args[1] = no_of_scm_blocks_to_unbind;
> +
> +    /* let unplug take care of actual unbind */
> +    return H_SUCCESS;
> +}
> +
> +#define H_UNBIND_SCOPE_ALL 0x1
> +#define H_UNBIND_SCOPE_DRC 0x2
> +
> +static target_ulong h_scm_unbind_all(PowerPCCPU *cpu, SpaprMachineState 
> *spapr,
> +                                     target_ulong opcode, target_ulong *args)
> +{
> +    uint64_t target_scope = args[0];
> +    uint32_t drc_index = args[1];
> +    uint64_t continue_token = args[2];
> +    NVDIMMDevice *nvdimm;
> +    uint64_t size;
> +    uint64_t no_of_scm_blocks_unbound = 0;
> +
> +    if (target_scope == H_UNBIND_SCOPE_DRC) {
> +        DeviceState *dev;
> +        SpaprDrc *drc = spapr_drc_by_index(drc_index);
> +
> +        if (drc && spapr_drc_type(drc) != SPAPR_DR_CONNECTOR_TYPE_PMEM) {
> +            return H_P2;
> +        }
> +
> +        dev = drc->dev;
> +        nvdimm = NVDIMM(dev);
> +        size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP, 
> NULL);
> +
> +        no_of_scm_blocks_unbound = size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +    } else if (target_scope ==  H_UNBIND_SCOPE_ALL) {
> +        GSList *list, *dimms;
> +
> +        dimms = nvdimm_get_device_list();
> +        for (list = dimms; list; list = list->next) {
> +            nvdimm = list->data;
> +            size = object_property_get_int(OBJECT(nvdimm), PC_DIMM_SIZE_PROP,
> +                                           NULL);
> +
> +            no_of_scm_blocks_unbound += size / SPAPR_MINIMUM_SCM_BLOCK_SIZE;
> +        }
> +        g_slist_free(dimms);
> +    } else {
> +        return H_PARAMETER;
> +    }
> +
> +    /* continue_token should be zero as this hcall doesn't return H_BUSY. */
> +    if (continue_token > 0) {
> +        return H_P4;
> +    }

Usually better to do simple checks like this before more complex logic
like you have above.

> +
> +    args[1] = no_of_scm_blocks_unbound;
> +
> +    /* let unplug take care of actual unbind */
> +    return H_SUCCESS;
> +}
> +
>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - 
> KVMPPC_HCALL_BASE + 1];
>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 
> 4 + 1];
> @@ -2079,6 +2372,13 @@ static void hypercall_register_types(void)
>      /* qemu/KVM-PPC specific hcalls */
>      spapr_register_hypercall(KVMPPC_H_RTAS, h_rtas);
>  
> +    /* qemu/scm specific hcalls */
> +    spapr_register_hypercall(H_SCM_READ_METADATA, h_scm_read_metadata);
> +    spapr_register_hypercall(H_SCM_WRITE_METADATA, h_scm_write_metadata);
> +    spapr_register_hypercall(H_SCM_BIND_MEM, h_scm_bind_mem);
> +    spapr_register_hypercall(H_SCM_UNBIND_MEM, h_scm_unbind_mem);
> +    spapr_register_hypercall(H_SCM_UNBIND_ALL, h_scm_unbind_all);
> +
>      /* ibm,client-architecture-support support */
>      spapr_register_hypercall(KVMPPC_H_CAS, h_client_architecture_support);
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index a8cb3513d0..e1933e877d 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -286,6 +286,7 @@ struct SpaprMachineState {
>  #define H_P7              -60
>  #define H_P8              -61
>  #define H_P9              -62
> +#define H_OVERLAP         -68
>  #define H_UNSUPPORTED_FLAG -256
>  #define H_MULTI_THREADS_ACTIVE -9005
>  
> @@ -493,8 +494,13 @@ struct SpaprMachineState {
>  #define H_INT_ESB               0x3C8
>  #define H_INT_SYNC              0x3CC
>  #define H_INT_RESET             0x3D0
> +#define H_SCM_READ_METADATA     0x3E4
> +#define H_SCM_WRITE_METADATA    0x3E8
> +#define H_SCM_BIND_MEM          0x3EC
> +#define H_SCM_UNBIND_MEM        0x3F0
> +#define H_SCM_UNBIND_ALL        0x3FC
>  
> -#define MAX_HCALL_OPCODE        H_INT_RESET
> +#define MAX_HCALL_OPCODE        H_SCM_UNBIND_ALL
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> 

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH v3 3/3] spapr: Add Hcalls to support PAPR NVDIMM device, David Gibson <=
Prev by Date: Re: [PATCH v2] ppc/pnv: Create BMC devices at machine init
Next by Date: Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
Previous by thread: [PATCH v2] ppc/pnv: Create BMC devices at machine init
Next by thread: Re: [PATCH v3 2/3] spapr: Add NVDIMM device support
Index(es):
- Date
- Thread