Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migratio

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migratio

From:	Stefan Hajnoczi
Subject:	Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration
Date:	Tue, 1 Feb 2022 09:37:37 +0000
On Tue, Feb 01, 2022 at 03:49:40AM +0000, Jag Raman wrote:
> 
> 
> > On Jan 28, 2022, at 3:29 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > 
> > On Thu, Jan 27, 2022 at 05:04:26PM +0000, Jag Raman wrote:
> >> 
> >> 
> >>> On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> 
> >>> On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
> >>>> +     * The client subsequetly asks the remote server for any data that
> >>> 
> >>> subsequently
> >>> 
> >>>> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> >>>> +{
> >>>> +    VfuObject *o = vfu_get_private(vfu_ctx);
> >>>> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> >>>> +    static int migrated_devs;
> >>>> +    Error *local_err = NULL;
> >>>> +    int ret;
> >>>> +
> >>>> +    /**
> >>>> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
> >>>> +     * VMSD data from source is not available at RESUME state.
> >>>> +     * Working on a fix for this.
> >>>> +     */
> >>>> +    if (!o->vfu_mig_file) {
> >>>> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
> >>>> +    }
> >>>> +
> >>>> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> >>>> +    if (ret) {
> >>>> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    qemu_file_shutdown(o->vfu_mig_file);
> >>>> +    o->vfu_mig_file = NULL;
> >>>> +
> >>>> +    /* VFU_MIGR_STATE_RUNNING begins here */
> >>>> +    if (++migrated_devs == k->nr_devs) {
> >>> 
> >>> When is this counter reset so migration can be tried again if it
> >>> fails/cancels?
> >> 
> >> Detecting cancellation is a pending item. We will address it in the
> >> next rev. Will check with you if  we get stuck during the process
> >> of implementing it.
> >> 
> >>> 
> >>>> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> >>>> +                                 uint64_t size, uint64_t offset)
> >>>> +{
> >>>> +    VfuObject *o = vfu_get_private(vfu_ctx);
> >>>> +
> >>>> +    if (offset > o->vfu_mig_buf_size) {
> >>>> +        return -1;
> >>>> +    }
> >>>> +
> >>>> +    if ((offset + size) > o->vfu_mig_buf_size) {
> >>>> +        warn_report("vfu: buffer overflow - check pending_bytes");
> >>>> +        size = o->vfu_mig_buf_size - offset;
> >>>> +    }
> >>>> +
> >>>> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> >>>> +
> >>>> +    o->vfu_mig_buf_pending -= size;
> >>> 
> >>> This assumes that the caller increments offset by size each time. If
> >>> that assumption is okay, then we can just trust offset and don't need to
> >>> do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
> >>> then the code needs to be extended to safely update vfu_mig_buf_pending
> >>> when offset jumps around arbitrarily between calls.
> >> 
> >> Going by the definition of vfu_migration_callbacks_t in the library, I 
> >> assumed
> >> that read_data advances the offset by size bytes.
> >> 
> >> Will add a comment a comment to explain that.
> >> 
> >>> 
> >>>> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
> >>>> +{
> >>>> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
> >>>> +    const VMStateField *field = NULL;
> >>>> +    uint64_t size = 0;
> >>>> +
> >>>> +    if (!dc->vmsd) {
> >>>> +        return 0;
> >>>> +    }
> >>>> +
> >>>> +    field = dc->vmsd->fields;
> >>>> +    while (field && field->name) {
> >>>> +        size += vmstate_size(pci_dev, field);
> >>>> +        field++;
> >>>> +    }
> >>>> +
> >>>> +    return size;
> >>>> +}
> >>> 
> >>> This function looks incorrect because it ignores subsections as well as
> >>> runtime behavior during save(). Although VMStateDescription is partially
> >>> declarative, there is still a bunch of imperative code that can write to
> >>> the QEMUFile at save() time so there's no way of knowing the size ahead
> >>> of time.
> >> 
> >> I see your point, it would be a problem for any field which has the
> >> (VMS_BUFFER | VMS_ALLOC) flags set.
> >> 
> >>> 
> >>> I asked this in a previous revision of this series but I'm not sure if
> >>> it was answered: is it really necessary to know the size of the vmstate?
> >>> I thought the VFIO migration interface is designed to support
> >>> streaming reads/writes. We could choose a fixed size like 64KB and
> >>> stream the vmstate in 64KB chunks.
> >> 
> >> The library exposes the migration data to the client as a device BAR with
> >> fixed size - the size of which is fixed at boot time, even when using
> >> vfu_migration_callbacks_t callbacks.
> >> 
> >> I don’t believe the library supports streaming vmstate/migration-data - see
> >> the following comment in migration_region_access() defined in the library:
> >> 
> >> * Does this mean that partial reads are not allowed?
> >> 
> >> Thanos or John,
> >> 
> >>    Could you please clarify this?
> >> 
> >> Stefan,
> >>    We attempted to answer the migration cancellation and vmstate size
> >>    questions previously also, in the following email:
> >> 
> >> https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71-2FCAFC0E671F@oracle.com/
> > 
> >> libvfio-user has the vfu_migration_callbacks_t interface that allows the
> >> device to save/load more data regardless of the size of the migration
> >> region. I don't see the issue here since the region doesn't need to be
> >> sized to fit the savevm data?
> > 
> > The answer didn't make sense to me:
> > 
> > "In both scenarios at the server end - whether using the migration BAR or
> > using callbacks, the migration data is transported to the other end using
> > the BAR. As such we need to specify the BAR’s size during initialization.
> > 
> > In the case of the callbacks, the library translates the BAR access to 
> > callbacks."
> > 
> > The BAR and the migration region within it need a size but my
> > understanding is that VFIO migration is designed to stream the device
> > state, allowing it to be broken up into multiple reads/writes with
> > knowing the device state's size upfront. Here is the description from
> > <linux/vfio.h>:
> > 
> >  * The sequence to be followed while in pre-copy state and stop-and-copy 
> > state
> >  * is as follows:
> >  * a. Read pending_bytes, indicating the start of a new iteration to get 
> > device
> >  *    data. Repeated read on pending_bytes at this stage should have no side
> >  *    effects.
> >  *    If pending_bytes == 0, the user application should not iterate to get 
> > data
> >  *    for that device.
> >  *    If pending_bytes > 0, perform the following steps.
> >  * b. Read data_offset, indicating that the vendor driver should make data
> >  *    available through the data section. The vendor driver should return 
> > this
> >  *    read operation only after data is available from (region + 
> > data_offset)
> >  *    to (region + data_offset + data_size).
> >  * c. Read data_size, which is the amount of data in bytes available through
> >  *    the migration region.
> >  *    Read on data_offset and data_size should return the offset and size of
> >  *    the current buffer if the user application reads data_offset and
> >  *    data_size more than once here.
> >  * d. Read data_size bytes of data from (region + data_offset) from the
> >  *    migration region.
> >  * e. Process the data.
> >  * f. Read pending_bytes, which indicates that the data from the previous
> >  *    iteration has been read. If pending_bytes > 0, go to step b.
> >  *
> >  * The user application can transition from the _SAVING|_RUNNING
> >  * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
> >  * number of pending bytes. The user application should iterate in _SAVING
> >  * (stop-and-copy) until pending_bytes is 0.
> > 
> > This means you can report pending_bytes > 0 until the entire vmstate has
> > been read and can pick a fixed chunk size like 64KB for the migration
> > region. There's no need to size the migration region to fit the entire
> > vmstate.
> 
> Thank you for the pointer to generic VFIO migration, Stefan! Makes sense.
> 
> So I understand that the VFIO migration region carves out a section to
> stream/shuttle device data between the app (QEMU client in this case) and the
> driver (QEMU server). This section starts at data_offset within the region 
> and spans
> data_size bytes.
> 
> We could change the server to stream the data as outlined above. Do you have a
> preference for the section size? Does qemu_target_page_size() work? I just 
> tested
> and am able to stream with a fixed BAR size such as qemu_target_page_size().

The VFIO migration API requires that data is written in the same chunk
sizes as it was read, so there is no way to merge or split chunks for
performance reasons once they have been read.

4KB may result in lots of chunks and that means more network traffic and
read()/write() calls. I think it's too small.

Something large like 1MB might create issues with responsiveness because
a 1MB chunk hogs the migration stream and read()/write() latency could
hog the event loop.

I'd go for 64KB. Dave and Juan might also have a suggestion for the size.

Stefan
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate migration, Stefan Hajnoczi <=
Prev by Date: Re: [PATCH v2 2/6] hw/hppa: Allow up to 16 emulated CPUs
Next by Date: Re: [PATCH v6 02/33] include/block/block: split header into I/O and global state API
Previous by thread: Re: [PATCH v2 2/6] hw/hppa: Allow up to 16 emulated CPUs
Next by thread: Re: [PATCH v6 02/33] include/block/block: split header into I/O and global state API
Index(es):
- Date
- Thread