qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation


From: Tarun Gupta (SW-GPU)
Subject: Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
Date: Tue, 16 Mar 2021 19:04:52 +0530
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1


On 3/12/2021 8:43 AM, Tian, Kevin wrote:

From: Tarun Gupta <targupta@nvidia.com>
Sent: Thursday, March 11, 2021 3:20 AM

Document interfaces used for VFIO device migration. Added flow of state
changes
during live migration with VFIO device. Tested by building docs with the new
vfio-migration.rst file.

v2:
- Included the new vfio-migration.rst file in index.rst
- Updated dirty page tracking section, also added details about
   'pre-copy-dirty-page-tracking' opt-out option.
- Incorporated comments around wording of doc.

Signed-off-by: Tarun Gupta <targupta@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
  MAINTAINERS                   |   1 +
  docs/devel/index.rst          |   1 +
  docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
  3 files changed, 137 insertions(+)
  create mode 100644 docs/devel/vfio-migration.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 738786146d..a2a80eee59 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1801,6 +1801,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
  S: Supported
  F: hw/vfio/*
  F: include/hw/vfio/
+F: docs/devel/vfio-migration.rst

  vfio-ccw
  M: Cornelia Huck <cohuck@redhat.com>
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index ae664da00c..5330f1ca1d 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -39,3 +39,4 @@ Contents:
     qom
     block-coroutine-wrapper
     multi-process
+   vfio-migration
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index 0000000000..6196fb132c
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,135 @@
+=====================
+VFIO device Migration
+=====================
+
+VFIO devices use an iterative approach for migration because certain VFIO
+devices (e.g. GPU) have large amount of data to be transfered. The iterative
+pre-copy phase of migration allows for the guest to continue whilst the VFIO
+device state is transferred to the destination, this helps to reduce the total
+downtime of the VM. VFIO devices can choose to skip the pre-copy phase of
+migration by returning pending_bytes as zero during the pre-copy phase.
+
+A detailed description of the UAPI for VFIO device migration can be found in
+the comment for the ``vfio_device_migration_info`` structure in the header
+file linux-headers/linux/vfio.h.
+
+VFIO device hooks for iterative approach:
+
+* A ``save_setup`` function that sets up the migration region, sets _SAVING
+  flag in the VFIO device state and informs the VFIO IOMMU module to start
+  dirty page tracking.
+
+* A ``load_setup`` function that sets up the migration region on the
+  destination and sets _RESUMING flag in the VFIO device state.
+
+* A ``save_live_pending`` function that reads pending_bytes from the
vendor
+  driver, which indicates the amount of data that the vendor driver has yet
to
+  save for the VFIO device.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver through the migration region during iterative phase.
+
+* A ``save_live_complete_precopy`` function that resets _RUNNING flag
from the
+  VFIO device state, saves the device config space, if any, and iteratively

and if any,

I didn't get this. I intended to say that it will save the device config space only if it is present.
So, used "saves device config space, if any".


+  copies the remaining data for the VFIO device untill the vendor driver
+  indicates that no data remains (pending bytes is zero).
+
+* A ``load_state`` function that loads the config section and the data
+  sections that are generated by the save functions above
+
+* ``cleanup`` functions for both save and load that perform any migration
+  related cleanup, including unmapping the migration region
+
+A VM state change handler is registered to change the VFIO device state
when
+the VM state changes.
+
+Similarly, a migration state change notifier is registered to get a
+notification on migration state change. These states are translated to VFIO
+device state and conveyed to vendor driver.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_sync`` memory listener callback marks those system memory pages
+as dirty which are used for DMA by the VFIO device. The dirty pages bitmap
is
+queried per container. All pages pinned by the vendor driver through
+vfio_pin_pages() external API have to be marked as dirty during migration.

why mention kernel internal functions in an userspace doc?

I'll remove the mention of vfio_pin_pages() and just mention "external API".


+When there are CPU writes, CPU dirty page tracking can identify dirtied
pages,
+but any page pinned by the vendor driver can also be written by device.
There
+is currently no device which has hardware support for dirty page tracking.

no device or IOMMU support

Right, will update it.


So
+all pages which are pinned by vendor driver are considered as dirty.

Similarly, why do we care about how the kernel identifies whether a page is
dirty. It could be dirtied due to pinning, or due to IOMMU dirty bit, or due
to IOMMU page fault. Here we'd better just focus on user-tangible effect,
e.g. a large/non-converging dirty map might be returned then how to handle
such situation...

Since VFIO migration feature is not just implemented in userspace but also involves implementation in kernel space as well, have documented here what is implemented as of now.


+
+By default, dirty pages are tracked when the device is in pre-copy as well as
+stop-and-copy phase. So, a page pinned by the vendor driver using
+vfio_pin_pages() will be copied to destination in both the phases. Copying
+dirty pages in pre-copy phase helps QEMU to predict if it can achieve its
+downtime tolerances.

worthy of some elaboration on the last sentence.

How about below?
"Copying dirty pages in pre-copy phase helps QEMU to predict if it can achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding dirty pages continuously,then it understands that even in stop-and-copy phase, it is likely to find dirty pages and can predict the downtime accordingly."


+
+QEMU also provides a per device opt-out option ``pre-copy-dirty-page-
tracking``
+to disable dirty page tracking during pre-copy phase. If it is set to off, all

IIUC dirty page tracking is always enabled in vfio_save_setup. What this option
does is to skip sync-ing dirty bitmap in vfio_listerner_log_sync.

I'll update it as below?

"QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking`` which disables querying dirty bitmap during pre-copy phase. If it is set to off, all dirty pages will be copied to destination in stop-and-copy phase only."


+pinned pages will be copied to destination in stop-and-copy phase only.
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+
+With vIOMMU, an IO virtual address range can get unmapped while in pre-
copy
+phase of migration. In that case, the unmap ioctl returns any pinned pages
in
+that range and QEMU reports corresponding guest physical pages dirty.

pinned pages -> dirty pages

Currently, all pinned pages are dirty pages.
But, agreed that dirty pages might be more accurate here, will update it.

Thanks,
Tarun


During
+stop-and-copy phase, an IOMMU notifier is used to get a callback for
mapped
+pages and then dirty pages bitmap is fetched from VFIO IOMMU modules
for those
+mapped ranges.
+
+Flow of state changes during Live migration
+===========================================
+
+Below is the flow of state change during live migration.
+The values in the brackets represent the VM state, the migration state, and
+the VFIO device state, respectively.
+
+Live migration save path
+------------------------
+
+::
+
+                        QEMU normal running state
+                        (RUNNING, _NONE, _RUNNING)
+                                  |
+                     migrate_init spawns migration_thread
+                Migration thread then calls each device's .save_setup()
+                    (RUNNING, _SETUP, _RUNNING|_SAVING)
+                                  |
+                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+             If device is active, get pending_bytes by .save_live_pending()
+          If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  Data of VFIO device for pre-copy phase is copied
+        Iterate till total pending bytes converge and are less than threshold
+                                  |
+  On migration completion, vCPU stops and
calls .save_live_complete_precopy for
+   each active device. The VFIO device is then transitioned into _SAVING
state
+                   (FINISH_MIGRATE, _DEVICE, _SAVING)
+                                  |
+     For the VFIO device, iterate in .save_live_complete_precopy until
+                         pending data is 0
+                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                                  |
+                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+             Migraton thread schedules cleanup bottom half and exits
+
+Live migration resume path
+--------------------------
+
+::
+
+              Incoming migration calls .load_setup for each device
+                       (RESTORE_VM, _ACTIVE, _STOPPED)
+                                 |
+       For each device, .load_state is called for that device section data
+                       (RESTORE_VM, _ACTIVE, _RESUMING)
+                                 |
+    At the end, .load_cleanup is called for each device and vCPUs are started
+                       (RUNNING, _NONE, _RUNNING)
+
+Postcopy
+========
+
+Postcopy migration is not supported for VFIO devices.
--
2.27.0

Thanks
Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]