qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v8] introduce vfio-user protocol specification


From: John Levon
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Mon, 10 May 2021 22:25:41 +0000

On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:

> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> 
> Elena A: I CCed you in case you want to review the

Sorry, we should have included Elena already.

> > +VFIO sparse mmap
> > +^^^^^^^^^^^^^^^^
> > +
> > ++------------------+----------------------------------+
> > +| Name             | Value                            |
> > ++==================+==================================+
> > +| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
> > ++------------------+----------------------------------+
> > +| version          | 0x1                              |
> > ++------------------+----------------------------------+
> > +| next             | <next>                           |
> > ++------------------+----------------------------------+
> > +| sparse mmap info | VFIO region info sparse mmap     |
> > ++------------------+----------------------------------+
> > +
> > +This capability is defined when only a subrange of the region supports
> > +direct access by the client via mmap(). The VFIO sparse mmap area is 
> > defined in
> > +``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area``).
> 
> It's a little early to reference struct vfio_region_sparse_mmap_area
> here because the parent struct vfio_region_info_cap_sparse_mmap is only
> referenced below. I suggest combining the two:
> 
>   The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
>   vfio_region_info_cap_sparse_mmap`` and ``struct
>   vfio_region_sparse_mmap_area``).

Good idea.

> > +Region IO FD info format
> > +^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+------+
> > +| Name        | Offset | Size |
> > ++=============+========+======+
> > +| argsz       | 16     | 4    |
> > ++-------------+--------+------+
> > +| flags       | 20     | 4    |
> > ++-------------+--------+------+
> > +| index       | 24     | 4    |
> > ++-------------+--------+------+
> > +| count       | 28     | 4    |
> > ++-------------+--------+------+
> > +| sub-regions | 32     | ...  |
> > ++-------------+--------+------+
> > +
> > +* *argsz* is the size of the region IO FD info structure plus the
> > +  total size of the sub-region array. Thus, each array entry "i" is at 
> > offset
> > +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for 
> > both IO
> > +  FD types, but this is not to be relied on.
> > +* *flags* must be zero
> > +* *index* is the index of memory region being queried
> > +* *count* is the number of sub-regions in the array
> > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > +
> > +The client must set ``flags`` to zero and specify the region being queried 
> > in
> > +the ``index``.
> > +
> > +The client sets the ``argsz`` field to indicate the maximum size of the 
> > response
> > +that the server can send, which must be at least the size of the response 
> > header
> > +plus space for the sub-region array. If the full response size exceeds 
> > ``argsz``,
> > +then the server must respond only with the response header and the Region 
> > IO FD
> > +info structure, setting in ``argsz`` the buffer size required to store the 
> > full
> > +response. In this case, no file descriptors are passed back.  The client 
> > then
> > +retries the operation with a larger receive buffer.
> > +
> > +The reply message will additionally include at least one file descriptor 
> > in the
> > +ancillary data. Note that more than one sub-region may share the same file
> > +descriptor.
> 
> How does this interact with the maximum number of file descriptors,
> max_fds? It is possible that there are more sub-regions than max_fds
> allows...

I think this would just be a matter of the client advertising a reasonably large
enough size for max_msg_fds. Do we need to worry about this?

> > +Each sub-region given in the response has one of two possible structures,
> > +depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or
> > +`VFIO_USER_IO_FD_TYPE_IOREGIONFD`:
> > +
> > +Sub-Region IO FD info format (ioeventfd)
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-----------+--------+------+
> > +| Name      | Offset | Size |
> > ++===========+========+======+
> > +| offset    | 0      | 8    |
> > ++-----------+--------+------+
> > +| size      | 8      | 8    |
> > ++-----------+--------+------+
> > +| fd_index  | 16     | 4    |
> > ++-----------+--------+------+
> > +| type      | 20     | 4    |
> > ++-----------+--------+------+
> > +| flags     | 24     | 4    |
> > ++-----------+--------+------+
> > +| padding   | 28     | 4    |
> > ++-----------+--------+------+
> > +| datamatch | 32     | 8    |
> > ++-----------+--------+------+
> > +
> > +* *offset* is the offset of the start of the sub-region within the region
> > +  requested ("physical address offset" for the region)
> > +* *size* is the length of the sub-region. This may be zero if the access 
> > size is
> > +  not relevant, which may allow for optimizations
> > +* *fd_index* is the index in the ancillary data of the FD to use for 
> > ioeventfd
> > +  notification; it may be shared.
> > +* *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD`
> > +* *flags* is any of:
> > +  * `KVM_IOEVENTFD_FLAG_DATAMATCH`
> > +  * `KVM_IOEVENTFD_FLAG_PIO`
> 
> The client must not trust the server, so care must be taken to validate
> flags and offsets. Failure to do so would allow the server to hijack I/O
> dispatch to addresses outside its regions (e.g. MMIO vs PIO or an offset
> beyond the end of the region).
> 
> It would help to mention this explicitly in the spec so that client
> implementors take care.

I'll add a note.

> > +  * `KVM_IOREGION_PIO`
> > +  * `KVM_IOREGION_POSTED_WRITES`
> > +* *user_data* is an opaque value passed back to the server via a message 
> > on the
> > +  file descriptor
> > +
> > +For further information on the ioregionfd-specific fields, see:
> > +https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/
> > +
> > +(FIXME: update with final API docs.)
> 
> I suggest postponing the ioregionfd part of the spec until the KVM code
> lands. In general the approach makes sense to me though, so I think it
> will be possible to merge it later with minimal changes.

I think it's useful to have it now, still, at least until we hit 1.0 and can't
make incompatible changes. We've already had several useful review comments from
yourself and others.
 
I agree it shouldn't be in place in any final version, though. (Ideally, we'd
have an implementation as well.)

> > +VFIO IRQ info format
> > +^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-------+--------+---------------------------+
> > +| Name  | Offset | Size                      |
> > ++=======+========+===========================+
> > +| argsz | 16     | 4                         |
> > ++-------+--------+---------------------------+
> > +| flags | 20     | 4                         |
> > ++-------+--------+---------------------------+
> > +|       | +-----+--------------------------+ |
> > +|       | | Bit | Definition               | |
> > +|       | +=====+==========================+ |
> > +|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
> > +|       | +-----+--------------------------+ |
> > +|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
> > +|       | +-----+--------------------------+ |
> > +|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
> > +|       | +-----+--------------------------+ |
> > +|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
> > +|       | +-----+--------------------------+ |
> > ++-------+--------+---------------------------+
> > +| index | 24     | 4                         |
> > ++-------+--------+---------------------------+
> > +| count | 28     | 4                         |
> > ++-------+--------+---------------------------+
> > +
> > +* *argsz* is the size of the VFIO IRQ info structure.
> > +* *flags* defines IRQ attributes:
> > +
> > +  * *VFIO_IRQ_INFO_EVENTFD* indicates the IRQ type can support server 
> > eventfd
> > +    signalling.
> > +  * *VFIO_IRQ_INFO_MASKABLE* indicates that the IRQ type supports the MASK 
> > and
> > +    UNMASK actions in a VFIO_USER_DEVICE_SET_IRQS message.
> > +  * *VFIO_IRQ_INFO_AUTOMASKED* indicates the IRQ type masks itself after 
> > being
> > +    triggered, and the client must send an UNMASK action to receive new
> > +    interrupts.
> > +  * *VFIO_IRQ_INFO_NORESIZE* indicates VFIO_USER_SET_IRQS operations setup
> > +    interrupts as a set, and new sub-indexes cannot be enabled without 
> > disabling
> > +    the entire type.
> > +
> > +* index is the index of IRQ type being queried, it is the only field that 
> > is
> > +  required to be set in the command message.
> > +* count describes the number of interrupts of the queried type.
> 
> Is count an output-only field since the previous sentence says index is
> the only field required in the command message?
> 
> I find it confusing that the spec shows the input/output structs without
> explicitly documenting that fields are input, output, or input & output.

I agree. I'll take care of this.

https://github.com/nutanix/libvfio-user/issues/486

> > +VFIO IRQ set format
> > +^^^^^^^^^^^^^^^^^^^
> > +
> > ++-------+--------+------------------------------+
> > +| Name  | Offset | Size                         |
> > ++=======+========+==============================+
> > +| argsz | 16     | 4                            |
> > ++-------+--------+------------------------------+
> > +| flags | 20     | 4                            |
> > ++-------+--------+------------------------------+
> > +|       | +-----+-----------------------------+ |
> > +|       | | Bit | Definition                  | |
> > +|       | +=====+=============================+ |
> > +|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
> > +|       | +-----+-----------------------------+ |
> > +|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
> > +|       | +-----+-----------------------------+ |
> > +|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
> > +|       | +-----+-----------------------------+ |
> > +|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
> > +|       | +-----+-----------------------------+ |
> > +|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
> > +|       | +-----+-----------------------------+ |
> > +|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
> > +|       | +-----+-----------------------------+ |
> > ++-------+--------+------------------------------+
> > +| index | 24     | 4                            |
> > ++-------+--------+------------------------------+
> > +| start | 28     | 4                            |
> > ++-------+--------+------------------------------+
> > +| count | 32     | 4                            |
> > ++-------+--------+------------------------------+
> > +| data  | 36     | variable                     |
> > ++-------+--------+------------------------------+
> > +
> > +* *argsz* is the size of the VFIO IRQ set structure, including any *data* 
> > field.
> > +* *flags* defines the action performed on the interrupt range. The DATA 
> > flags
> > +  describe the data field sent in the message; the ACTION flags describe 
> > the
> > +  action to be performed. The flags are mutually exclusive for both sets.
> > +
> > +  * *VFIO_IRQ_SET_DATA_NONE* indicates there is no data field in the 
> > command.
> > +    The action is performed unconditionally.
> > +  * *VFIO_IRQ_SET_DATA_BOOL* indicates the data field is an array of 
> > boolean
> > +    bytes. The action is performed if the corresponding boolean is true.
> > +  * *VFIO_IRQ_SET_DATA_EVENTFD* indicates an array of event file 
> > descriptors
> > +    was sent in the message meta-data. These descriptors will be signalled 
> > when
> 
> signalled...by the client or by the server?

Either.

> For example, does VFIO_IRQ_SET_ACTION_TRIGGER +
> VFIO_IRQ_SET_DATA_EVENTFD provide an eventfd that the server will signal
> when the device raises the interrupt?

The server can trigger it, but so can the client (for whatever reason) via a
combination of VFIO_IRQ_SET_ACTION_TRIGGER+VFIO_IRQ_SET_DATA_BOOL/NONE.
> 
> On the other hand, VFIO_IRQ_SET_ACTION_MASK + VFIO_IRQ_SET_DATA_EVENTFD
> seems to be the other way around. The server reads from the eventfd to
> respond when the irq is masked.
> 
> > +    the action defined by the action flags occurs. In AF_UNIX sockets, the
> > +    descriptors are sent as SCM_RIGHTS type ancillary data.
> > +    If no file descriptors are provided, this de-assigns the specified
> > +    previously configured interrupts.
> > +  * *VFIO_IRQ_SET_ACTION_MASK* indicates a masking event. It can be used 
> > with
> > +    VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to mask an interrupt, 
> > or
> > +    with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when the guest 
> > masks
> > +    the interrupt.
> > +  * *VFIO_IRQ_SET_ACTION_UNMASK* indicates an unmasking event. It can be 
> > used
> > +    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to unmask an
> > +    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when 
> > the
> > +    guest unmasks the interrupt.
> > +  * *VFIO_IRQ_SET_ACTION_TRIGGER* indicates a triggering event. It can be 
> > used
> > +    with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE to trigger an
> > +    interrupt, or with VFIO_IRQ_SET_DATA_EVENTFD to generate an event when 
> > the
> > +    server triggers the interrupt.

I believe (could be wrong) we inherited these semantics (and IMO rather
unfortunate naming) from vfio.

> Maybe the text can be restructured to make this clear.

Yes, I think it probably can, let me take a pass.

> > +VFIO_USER_REGION_READ
> > +---------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 9                      |
> > ++--------------+------------------------+
> > +| Message size | 32 + data size         |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error        | 0/errno                |
> > ++--------------+------------------------+
> > +| Read info    | REGION read/write data |
> > ++--------------+------------------------+
> > +
> > +This command message is sent from the client to the server to read from 
> > server
> > +memory.  In the command messages, there is no data, and the count is the 
> > amount
> > +of data to be read. The reply message must include the data read, and its 
> > count
> > +field is the amount of data read.
> 
> There is no data in command messages, but Message size is still 32 +
> data size?

It is not. Spec was unclear here, I've added some text to this and
VFIO_DMA_READ.

> > +VFIO_USER_VM_INTERRUPT
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++----------------+------------------------+
> > +| Name           | Value                  |
> > ++================+========================+
> > +| Message ID     | <ID>                   |
> > ++----------------+------------------------+
> > +| Command        | 13                     |
> > ++----------------+------------------------+
> > +| Message size   | 20                     |
> > ++----------------+------------------------+
> > +| Flags          | Reply bit set in reply |
> > ++----------------+------------------------+
> > +| Error          | 0/errno                |
> > ++----------------+------------------------+
> > +| Interrupt info | <interrupt>            |
> > ++----------------+------------------------+
> > +
> > +This command message is sent from the server to the client to signal the 
> > device
> > +has raised an interrupt.
> 
> Except if the client set up irq eventfds?

Clarified.

> > +Interrupt info format
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-----------+--------+------+
> > +| Name      | Offset | Size |
> > ++===========+========+======+
> > +| Sub-index | 16     | 4    |
> > ++-----------+--------+------+
> > +
> > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used 
> > in PCI
> > +  MSI/X type interrupts.
> 
> Hmm...this is weird. The server tells the client to raise an MSI-X
> interrupt but does not include the MSI message that resides in the MSI-X
> table BAR device region? Or should MSI-X interrupts be delivered to the
> client via VFIO_USER_DMA_WRITE instead?
> 
> (Basically it's not clear to me how MSI-X interrupts would work with
> vfio-user. Reading how they work in kernel VFIO might let me infer it,
> but it's probably worth explaining this clearly in the spec.)

It doesn't. We don't have an implementation, and the qemu patches don't get this
right either - it treats the sub-index as the IRQ index AKA IRQ type.

I'd be inclined to just remove this for now, until we have an implementation.
Thoughts?

> > +VFIO_USER_DEVICE_RESET
> > +----------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 14                     |
> > ++--------------+------------------------+
> > +| Message size | 16                     |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error        | 0/errno                |
> > ++--------------+------------------------+
> > +
> > +This command message is sent from the client to the server to reset the 
> > device.
> 
> Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> In some cases a reset involves the server communicating with other
> systems or components and this can take an unbounded amount of time.
> Therefore this message could hang. For example, if a vfio-user NVMe
> device was accessing data on a hung NFS export and there were I/O
> requests in flight that need to be aborted.

I'm not sure this is something we could put in the generic spec. Perhaps a
caveat?

> > +VFIO_USER_DIRTY_PAGES
> > +---------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------------+------------------------+
> > +| Name               | Value                  |
> > ++====================+========================+
> > +| Message ID         | <ID>                   |
> > ++--------------------+------------------------+
> > +| Command            | 15                     |
> > ++--------------------+------------------------+
> > +| Message size       | 16                     |
> > ++--------------------+------------------------+
> > +| Flags              | Reply bit set in reply |
> > ++--------------------+------------------------+
> > +| Error              | 0/errno                |
> > ++--------------------+------------------------+
> > +| VFIO Dirty bitmap  | <dirty bitmap>         |
> > ++--------------------+------------------------+
> > +
> > +This command is analogous to VFIO_IOMMU_DIRTY_PAGES. It is sent by the 
> > client
> > +to the server in order to control logging of dirty pages, usually during a 
> > live
> > +migration. The VFIO dirty bitmap structure is defined in ``<linux/vfio.h>``
> > +(``struct vfio_iommu_type1_dirty_bitmap``).
> 
> Do all vfio-user servers need to implement VFIO_USER_DIRTY_PAGES? It's
> common for some device implementations to omit migration support because
> it increases implementation complexity and is not needed in certain use
> cases.

Added a note that this is optional.

thanks
john



reply via email to

[Prev in Thread] Current Thread [Next in Thread]