[RFC] cxl: Multi-headed device design

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RFC] cxl: Multi-headed device design

From:	Gregory Price
Subject:	[RFC] cxl: Multi-headed device design
Date:	Tue, 21 Mar 2023 21:50:33 -0400
Originally I was planning to kick this off with a patch set, but i've
decided my current prototype does not fit the extensibility requirements
to go from SLD to MH-SLD to MH-MLD.


So instead I'd like to kick off by just discussing the data structures
and laugh/cry a bit about some of the frustrating ambiguities for MH-SLDs
when it comes to the specification.

I apologize for the sheer length of this email, but it really is just
that complex.


=============================================================
 What does the specification say about Multi-headed Devices? 
=============================================================

Defining each relevant component according to the specification:

>
> VCS - Virtual CXL Switch
> * Includes entities within the physical switch belonging to a
>   single VH. It is identified using the VCS ID.
> 
> 
> VH - Virtual Hierarchy.
> * Everything from the CXL RP down.
> 
> 
> LD - Logical Device
> * Entity that represents a CXL Endpoint that is bound to a VCS.
>   An SLD device contains one LD.  An MLD contains multiple LDs.
> 
> 
> SLD - Single Logical Device
> * That's it, that's the definition.
> 
> 
> MLD - Multi Logical Device
> * Multi-Logical Device. CXL component that contains multiple LDs,
>   out of which one LD is reserved for configuration via the FM API,
>   and each remaining LD is suitable for assignment to a different
>   host. Currently MLDs are architected only for Type 3 LDs.
> 
> 
> MH-SLD - Mutli-Headed SLD
> * CXL component that contains multiple CXL ports, each presenting an
>   SLD. The ports must correctly operate when connected to any
>   combination of common or different hosts.
> 
> 
> MH-MLD - Multi-Headed MLD
> * CXL component that contains multiple CXL ports, each presenting an MLD
>   or SLD. The ports must correctly operate when connected to any
>   combination of common or different hosts. The FM-API is used to
>   configure each LD as well as the overall MH-MLD.
> 
>   MH-MLDs are considered a specialized type of MLD and, as such, are
>   subject to all functional and behavioral requirements of MLDs.
> 

Ambiguity #1:

* An SLD contains 1 Logical Device.
* An MH-SLD presents multiple SLDs, one per head.

Ergo an MH-SLD contains multiple LDs which makes it an MLD according to the
definition of LD, but not according to the definition of MLD, or MH-MLD.

Now is the winter of my discontent.

The Specification says this about MH-SLD's in other sections

> 2.4.3 Pooled and Shared FAM
> 
> LD-FAM includes several device variants.
> 
> A multi-headed Single Logical Device (MH-SLD) exposes multiple LDs, each with
> a dedicated link.
> 
>
> 2.5 Multi-Headed Device
> 
> There are two types of Multi-Headed Devices that are distinguied by how
> they present themselves on each head:
> *  MH-SLD, which present SLDs on all head
> *  MH-MLD, which may present MLDs on any of their heads
>
>
> Management of heads in Multi-Headed Devices follows the model defined for
> the device presented by that head:
> *  Heads that present SLDs may support the port management and control
>     features that are available for SLDs
> *  Heads that present MLDs may support the port management and control
>    features that are available for MLDs
>

I want to make very close note of this.  SLD's are managed like SLDs
SLDs, MLDs are managed like MLDs.  MH-SLDs, according to this, should be
managed like SLDs from the perspective of each host.

That's pretty straight forward.

>
> Management of memory resources in Multi-Headed Devices follows the model
> defined for MLD components because both MH-SLDs and MH-MLDs must support
> the isolation of memory resources, state, context, and management on a
> per-LD basis.  LDs within the device are mapped to a single head.
> 
> *  In MH-SLDs, there is a 1:1 mapping between heads and LDs.
> *  In MH-MLDs, multiple LDs are mapped to at most one head.
> 
> 
> Multi-Headed Devices expose a dedicated Component Command Interface (CCI),
> the LD Pool CCI, for management of all LDs within the device. The LD Pool
> CCI may be exposed as an MCTP-based CCI or can be accessed via the Tunnel
> Management Command command through a head’s Mailbox CCI, as detailed in
> Section 7.6.7.3.1.

2.5.1 continues on to describe "LD Management in MH-MLDs" but just ignores
that MH-SLDs (may) exist.  That's frustrating to say the least, but I
suppose we can gather from context that MH-SLD's *MAY NOT* have LD
management controls.

Lets see if that assumption holds.

> 7.6.7.3 MLD Port Command Set
>
> 7.6.7.3.1 Tunnel Management Command (Opcode 5300h)

The referenced section at the end of 2.5 seems to also suggest that
MH-SLDs do not (or don't have to?) implement the tunnel management
command set.  It sends us to the MLD command set, and SLDs don't get
managed like MLDs - ergo it's not relevant?

The final mention of MH-SLDs is mentioned in section 9.13.3

> 9.13.3 Dynamic Capacity Device
> ...
>  MH-SLD or MH-MLD based DCD shall forcefully release shared Dynamic
>  Capacity associated with all associated hosts upon a Conventional Reset
>  of a head.
>

>From this we can gather that the specification foresaw someone making a
memory pool from an MH-SLD... but without LD management. We can fill in
some blanks and assume that if someone wanted to, they could make a
shared memory device and implement pooling via software controls.

That'd be a neat bodge/hack.  But that's not important right now.


Finally, we look at what the mailbox command-set requirements are for
multi-headed devices:

> 7.6.7.5 Multi-Headed Device Command Set
> The Multi-Headed device command set includes commands for querying the
> Head-to-LD mapping in a Multi-Headed device. Support for this command
> set is required on the LD Pool CCI of a Multi-Headed device.
>

Ambiguity #2: Ok, now we're not sure whether an MH-SLD is supposed to
expose an LD Pool CCI or not.  Also, is a MH-SLD supposed to show up
as something other than just an SLD?  This is really confusing.

Going back to the MLD Port Command set, we see

> Valid targets for the tunneled commands include switch MLD Ports,
> valid LDs within an MLD, and the LD Pool CCI in a Multi-Headed device.

Whatever the case, there's only a single command in the MHD command set:

> 7.6.7.5.1 Get Multi-Headed Info (Opcode 5500h)

This command is pretty straight forward, it just tells you what the head
to LD mapping is for each of the LDs in the device.  Presumably this is
what gets modified by the FM-APIs when LDs are attached to VCS ports.

For the simplest MH-SLD device, these fields would be immutable, and
there would be a single LD for each head, where head_id == ld_id.



So summarizing, what I took away from this was the following:

In the simplest form of MH-SLD, there's is neither a switch, nor is
thereo LD management.  So, presumably, we don't HAVE to implement the
MHD commands to say we "have MH-SLD support".


========
 Design
========

Ok... that's a lot to break down.  Here's what I think the roadmap
toward multi-headed multi-logical device support should look like:

1. SLD - we have this.  This is struct CXLType3Dev

2. MH-SLD No Switch, No Pool CCI.

3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)

4. MH-SLD w/ Switch (Implementing remap-ability of LD to Head)

5. MH-MLD - the whole kit and kaboodle.


Lets talk about what the first MH-SLD might look like.


=================================
2. MH-SLD No Switch, No Pool CCI.
=================================

1. The device has a "memory pool" that "backs" each Logical Device, and
   the specification does not limit whether this memory is discrete
   or may be shared between heads.

   In QEMU, we can represent this with a shared or file memory backend:

-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true


2. Each QEMU instance has a discrete SLD that amounts to its own private
   CXLType3Dev.  However, each "Head" maps back to the same common
   memory backend:

-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0


And that's it.  In fact, you can do this now, no changes needed!


But it's also not very useful.  You can only use the memory in devdax
mode, since it's a shared memory region. You could already do this via
the /dev/shm interface, so it's not even new functionality.

In theory you could build a pooling service in software-only on top of
memory blocks. That's an exercise left to the reader.


================================================================
3. MH-SLD w/ Pool CCI  (Implementing only Get Multi-Headed Info)
================================================================

This is a little more complicated, we have our first bit of shared
state.  Originally I had considered a shared memory region in
CXLType3Dev, but this is a backwards abstraction (A MH-SLD contains
mutliple SLDs, an SLD does not contain an MHD State).

diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
index 7b72345079..1a9f2708e1 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -356,16 +356,6 @@ typedef struct CXLPoison {
 typedef QLIST_HEAD(, CXLPoison) CXLPoisonList;
 #define CXL_POISON_LIST_LIMIT 256

+struct CXLMHDState {
+    uint8_t nr_heads;
+    uint8_t nr_lds;
+    uint8_t ldmap[];
+};
+
 struct CXLType3Dev {
     /* Private */
     PCIDevice parent_obj;
@@ -377,15 +367,6 @@ struct CXLType3Dev {
     HostMemoryBackend *lsa;
     uint64_t sn;

+
+    /* Multi-headed device settings */
+    struct {
+        bool active;
+        uint32_t headid;
+        uint32_t shmid;
+        struct CXLMHDState *state;
+    } mhd;
+


The way you would instantiate this would be a via a separate process
that initializes the shared memory region:

shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`
./cxl_mhd_init 4 $shmid1
-device 
cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd=true,mhd_head=0,mhd_shmid=$1

./cxl_mhd_init would simply setup the nr_heads/lds field and such
and set ldmap[0-3] to the values [0-3].  i.e. the head-to-ld mappings
are static (head_id==ld_id).



But like I said, this is a backwards abstraction, so realistically we
should flip this around such that we have the following:

struct CXLMHD_SharedState {
        uint8_t nr_heads;
        uint8_t nr_lds;
        uint8_t ldmap[];
};

struct CXLMH_SLD {
        uint32_t headid;
        uint32_t shmid;
        struct CXLMHD_SharedState *state;
        struct CXLType3Dev sld;
};

The shared state would be instantiated the same way as above.

With this we'd basically just create a new memory device:

hw/mem/cxl_mh_sld.c


This is pretty straightforward - we just expose some of cxl_type3.c
functions in order to instantiate the device accordingly, the rest of it
just becomes passthrough because... it's just a cxl_type3.c device.


This ultimately manifests as:

shmid1=`ipcmk -M 4096 | grep -o -E '[0-9]+' | head -1`

./cxl_mhd_init 4 $shmid1

-device 
cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid


Note: This is the patch set i'm working towards, but I presume there
might be some (strong) opinions, so i didn't want to get too far into
development before posting this.


==============================================================
4. MH-SLD w/ Switch (Implementing LD management in an SLD)
==============================================================

Is it even rational to try to build such a device?

MH-SLDs have a 1-to-1 mapping of Head:Logical Device.

Presumably each SLD maps the entirety of the "pooled" memory,
but the specification does not state that is true.  You could, for
example, setup each Logical Device to map to a particular portion of the
shared/pooled memory area:

-object memory-backend-file,id=mem0,mem-path=/tmp/mem0,size=4G,share=true

QEMU #1
-device 
cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=0,dpa_limit=1G

QEMU #2
-device 
cxl-mhd-sld,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,mhd_head=0,mhd_shmid=shmid,dpa_base=1G,dpa_limit=1G

... and so on.

At least in theory, this would involve implementing something that
changes which SLD is mapped to a QEMU instance - but functionally this
is just changing the base and limit of each SLD.

It's interesting from a functional testing perspective, there's a bunch
of CCI/Tunnel commands that could be implemented, and presumably this
would require a separate process to manage/serialize appropriately.

=======================================
5. MH-MLD - the whole kit and kaboodle.
=======================================

If we implemented MH-SLD w/ Switching, then presumably it's just on step
further to create an MLD:

struct CXLMH_MLD {
        uint32_t headid;
        uint32_t shmid;
        struct CXLMHD_SharedState *state;
        struct CXLType3Dev ldmap[];
};

But i'm greatly oversimplifying here.  It's much more expressive to
describe an MLD in terms of a multi-tired switch in the QEMU topology,
similar to what can be done right now:

-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12 \
-device cxl-rp,id=rp0,port=0,bus=cxl.0,chassis=0,slot=0 \
-device cxl-rp,id=rp1,port=1,bus=cxl.0,chassis=0,slot=1 \
-device cxl-upstream,bus=rp0,id=us0 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
-device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
-device cxl-type3,bus=swport0,volatile-memdev=mem0,id=cxl-mem0 \
-M 
cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k


But in order to make this multi-headed, some amount of this state would need
to be encapsulated in a shared memory region (or would it? I don't know, i
haven't finished this thought experiment yet).


=====
 FIN 
=====

I realize this was a long.  If you made it to the end of this email,
thank you reading my TED talk.  I greatly appreciate any comments,
even if it's just "You've gone too deep, Gregory." ;]

Regards,
~Gregory
[Prev in Thread]
Current Thread
[Next in Thread]
[RFC] cxl: Multi-headed device design, Gregory Price <=
Prev by Date: Echa un vistazo a la variedad de JANDEI
Next by Date: Re: [PATCH for-8.0 v2 1/3] async: Suppress GCC13 false positive in aio_bh_poll()
Previous by thread: Echa un vistazo a la variedad de JANDEI
Next by thread: [PATCH for-8.0] aio-posix: fix race between epoll upgrade and aio_set_fd_handler()
Index(es):
- Date
- Thread