[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v8] introduce vfio-user protocol specification
From: |
Stefan Hajnoczi |
Subject: |
Re: [PATCH v8] introduce vfio-user protocol specification |
Date: |
Tue, 11 May 2021 11:09:53 +0100 |
On Mon, May 10, 2021 at 10:25:41PM +0000, John Levon wrote:
> On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > > +Region IO FD info format
> > > +^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------+------+
> > > +| Name | Offset | Size |
> > > ++=============+========+======+
> > > +| argsz | 16 | 4 |
> > > ++-------------+--------+------+
> > > +| flags | 20 | 4 |
> > > ++-------------+--------+------+
> > > +| index | 24 | 4 |
> > > ++-------------+--------+------+
> > > +| count | 28 | 4 |
> > > ++-------------+--------+------+
> > > +| sub-regions | 32 | ... |
> > > ++-------------+--------+------+
> > > +
> > > +* *argsz* is the size of the region IO FD info structure plus the
> > > + total size of the sub-region array. Thus, each array entry "i" is at
> > > offset
> > > + i * ((argsz - 32) / count). Note that currently this is 40 bytes for
> > > both IO
> > > + FD types, but this is not to be relied on.
> > > +* *flags* must be zero
> > > +* *index* is the index of memory region being queried
> > > +* *count* is the number of sub-regions in the array
> > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > +
> > > +The client must set ``flags`` to zero and specify the region being
> > > queried in
> > > +the ``index``.
> > > +
> > > +The client sets the ``argsz`` field to indicate the maximum size of the
> > > response
> > > +that the server can send, which must be at least the size of the
> > > response header
> > > +plus space for the sub-region array. If the full response size exceeds
> > > ``argsz``,
> > > +then the server must respond only with the response header and the
> > > Region IO FD
> > > +info structure, setting in ``argsz`` the buffer size required to store
> > > the full
> > > +response. In this case, no file descriptors are passed back. The client
> > > then
> > > +retries the operation with a larger receive buffer.
> > > +
> > > +The reply message will additionally include at least one file descriptor
> > > in the
> > > +ancillary data. Note that more than one sub-region may share the same
> > > file
> > > +descriptor.
> >
> > How does this interact with the maximum number of file descriptors,
> > max_fds? It is possible that there are more sub-regions than max_fds
> > allows...
>
> I think this would just be a matter of the client advertising a reasonably
> large
> enough size for max_msg_fds. Do we need to worry about this?
vhost-user historically only supported passing 8 fds and it became a
problem there.
I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
doorbells). Probably not 1000s.
If I was implementing a server I would check the negotiated max_fds and
refuse to start the vfio-user connection if the device has been
configured to require more sub-regions. Failing early and printing an
error would allow users to troubleshoot the issue and re-configure the
client/server.
This seems okay but the spec doesn't mention it explicitly so I wanted
to check what you had in mind.
> > > +Interrupt info format
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-----------+--------+------+
> > > +| Name | Offset | Size |
> > > ++===========+========+======+
> > > +| Sub-index | 16 | 4 |
> > > ++-----------+--------+------+
> > > +
> > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used
> > > in PCI
> > > + MSI/X type interrupts.
> >
> > Hmm...this is weird. The server tells the client to raise an MSI-X
> > interrupt but does not include the MSI message that resides in the MSI-X
> > table BAR device region? Or should MSI-X interrupts be delivered to the
> > client via VFIO_USER_DMA_WRITE instead?
> >
> > (Basically it's not clear to me how MSI-X interrupts would work with
> > vfio-user. Reading how they work in kernel VFIO might let me infer it,
> > but it's probably worth explaining this clearly in the spec.)
>
> It doesn't. We don't have an implementation, and the qemu patches don't get
> this
> right either - it treats the sub-index as the IRQ index AKA IRQ type.
>
> I'd be inclined to just remove this for now, until we have an implementation.
> Thoughts?
I don't remember the details of kernel VFIO irqs but it has an interface
where VFIO notifies KVM of configured irqs so that KVM can set up Posted
Interrupts. I think vfio-user would use KVM irqfd eventfds for efficient
interrupt injection instead since we're not trying to map a host
interrupt to a guest interrupt.
Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
be necessary but for now this can be dropped.
> > > +VFIO_USER_DEVICE_RESET
> > > +----------------------
> > > +
> > > +Message format
> > > +^^^^^^^^^^^^^^
> > > +
> > > ++--------------+------------------------+
> > > +| Name | Value |
> > > ++==============+========================+
> > > +| Message ID | <ID> |
> > > ++--------------+------------------------+
> > > +| Command | 14 |
> > > ++--------------+------------------------+
> > > +| Message size | 16 |
> > > ++--------------+------------------------+
> > > +| Flags | Reply bit set in reply |
> > > ++--------------+------------------------+
> > > +| Error | 0/errno |
> > > ++--------------+------------------------+
> > > +
> > > +This command message is sent from the client to the server to reset the
> > > device.
> >
> > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> > In some cases a reset involves the server communicating with other
> > systems or components and this can take an unbounded amount of time.
> > Therefore this message could hang. For example, if a vfio-user NVMe
> > device was accessing data on a hung NFS export and there were I/O
> > requests in flight that need to be aborted.
>
> I'm not sure this is something we could put in the generic spec. Perhaps a
> caveat?
It's up to you whether you want to discuss this in the spec or let
client implementors figure it out themselves. Any vfio-user message can
take an unbounded amount of time and we could assume that readers will
think of this.
VFIO_USER_DEVICE_RESET is just particularly likely to be called by
clients from a synchronous code path. QEMU moved the monitor (RPC
interface) fd into a separate thread in order to stay responsive when
the main event loop is blocked for any reason, so the issue came to
mind.
signature.asc
Description: PGP signature
Re: [PATCH v8] introduce vfio-user protocol specification, Alex Williamson, 2021/05/19