Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit

From:	Peter Xu
Subject:	Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded
Date:	Wed, 21 Aug 2024 15:45:26 -0400
On Fri, Jul 26, 2024 at 12:41:57AM -0700, Elena Ufimtseva wrote:
> On Wed, Jun 26, 2024 at 02:41:34PM -0400, Peter Xu wrote:
> > On Wed, Jun 26, 2024 at 12:04:43PM +0100, Joao Martins wrote:
> > > Are you thinking in something specifically?
> > 
> > Not really. I don't think I have any idea on how to make it better,
> > unfortunately, but we did some measurement too quite some time ago and I
> > can share some below.
> 
> 
> Hello Peter
> 
> I apologize for such a long delay with the reply.

I apologize back.. :)

> 
> > 
> > > 
> > > Many "variables" affect this from the point we decide switchover, and at 
> > > the
> > > worst (likely) case it means having qemu subsystems declare empirical 
> > > values on
> > > how long it takes to suspend/resume/transfer-state to migration expected
> > > downtime prediction equation. Part of the reason that having headroom 
> > > within
> > > downtime-limit was a simple 'catch-all' (from our PoV) in terms of
> > > maintainability while giving user something to fallback for 
> > > characterizing its
> > > SLA.
> > 
> > Yes, I think this might be a way to go, by starting with something that can
> > catch all.
> 
> 
> Possibly the title "strict SLA" is not the best choice of
> words as it creates impression that the guarantees will be met.
> But essentially this switchover limit is a safeguard against the unknowns
> that can contribute to the downtime during the stop-copy and can be not
> that easy to account for (or even impossible due to hardware
> implementation or other issues).
> 
> To show what kind of statistics we see in our environments and what
> are the main contributors please see below.
> 
> Example 1: host migration, default downtime set to 300:
> 
> Checkpoints analysis:
> 
>   checkpoint=src-downtime-start -> checkpoint=src-vm-stopped:                 
>   74244 (us)
>   checkpoint=src-vm-stopped -> checkpoint=src-iterable-saved:                 
>   154493 (us)
>   checkpoint=src-iterable-saved -> checkpoint=src-non-iterable-saved:         
>   4746 (us)
>   checkpoint=src-non-iterable-saved -> 
> checkpoint=dst-precopy-loadvm-completed: 224981 (us)

This is the first cross-host time diff, I assume the two hosts are
synchronized and the time diff is pretty low.

I guess part like this is not accounted in the switchover decision by src
QEMU, with the hope that after src QEMU sent everything with the estimated
bandwidth the dest QEMU should have (mostly) loaded everything.

It might be interesting to know when this can grow much larger than we
expect.  More below..

>   checkpoint=dst-precopy-loadvm-completed -> checkpoint=dst-precopy-bh-enter: 
>   36 (us)
>   checkpoint=dst-precopy-bh-enter -> checkpoint=dst-precopy-bh-announced:     
>   7859 (us)
>   checkpoint=dst-precopy-bh-announced -> 
> checkpoint=dst-precopy-bh-vm-started:  15995 (us)
>   checkpoint=dst-precopy-bh-vm-started -> checkpoint=src-downtime-end:        
>   236 (us)
> 
> Iterable device analysis:
> 
>   Device SAVE of                                      ram:  0 took     151054 
> (us)
>   Device LOAD of                                      ram:  0 took     146855 
> (us)
>   Device SAVE of              0000:20:04.0:00.0:00.0/vfio:  0 took       2127 
> (us)
>   Device LOAD of              0000:20:04.0:00.0:00.0/vfio:  0 took     144202 
> (us)
> 
> Non-iterable device analysis:
> 
>   Device LOAD of              0000:20:04.0:00.0:00.0/vfio:  0 took      67470 
> (us)
>   Device LOAD of                         0000:00:01.0/vga:  0 took       7527 
> (us)
>   Device LOAD of                      0000:00:02.0/e1000e:  0 took       1715 
> (us)
>   Device LOAD of                              kvm-tpr-opt:  0 took       1697 
> (us)
>   Device LOAD of                  0000:00:03.0/virtio-blk:  0 took       1340 
> (us)
>   Device SAVE of                      0000:00:02.0/e1000e:  0 took       1036 
> (us)
>   Device LOAD of                         0000:00:00.0/mch:  0 took       1035 
> (us)
>   Device LOAD of         0000:20:04.0:00.0/pcie-root-port:  0 took        976 
> (us)
>   Device LOAD of                     0000:00:1f.0/ICH9LPC:  0 took        851 
> (us)
>   Device LOAD of                   0000:00:1f.2/ich9_ahci:  0 took        578 
> (us)
> 
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> clear-bitmap-shift: 18
> Migration status: completed
> total time: 5927 ms
> downtime: 483 ms
> setup: 78 ms
> transferred ram: 883709 kbytes
> throughput: 1237.71 mbps
> remaining ram: 0 kbytes
> total ram: 33571656 kbytes
> duplicate: 8192488 pages
> skipped: 0 pages
> normal: 201300 pages
> normal bytes: 805200 kbytes
> dirty sync count: 3
> page size: 4 kbytes
> multifd bytes: 0 kbytes
> pages-per-second: 958776
> precopy ram: 480464 kbytes
> downtime ram: 398313 kbytes
> vfio device transferred: 4496 kbytes
> 
> Example 2: different system than above,  live migration over 100Gbit/s 
> connection and 2 vfio virtual functions (the guest has no workload and
> vfio devices are not engaged in VM and have same amount of data to live
> migrate).
> 
> Displayed outliers that are larger than 3 us.
> 
> Save:
> 252812@1721976657.700972:vmstate_downtime_checkpoint src-downtime-start
> 252812@1721976657.829180:vmstate_downtime_checkpoint src-vm-stopped
> 252812@1721976657.967987:vmstate_downtime_save type=iterable idstr=ram 
> instance_id=0 downtime=138005
> 252812@1721976658.093218:vmstate_downtime_save type=iterable 
> idstr=0000:00:02.0/vfio instance_id=0 downtime=125188
> 252812@1721976658.318101:vmstate_downtime_save type=iterable 
> idstr=0000:00:03.0/vfio instance_id=0 downtime=224857
> 252812@1721976658.318125:vmstate_downtime_checkpoint src-iterable-saved
> ...
> 
> Load:
> 353062@1721976488.995582:vmstate_downtime_load type=iterable idstr=ram 
> instance_id=0 downtime=117294
> 353062@1721976489.235227:vmstate_downtime_load type=iterable 
> idstr=0000:00:02.0/vfio instance_id=0 downtime=239586
> 353062@1721976489.449736:vmstate_downtime_load type=iterable 
> idstr=0000:00:03.0/vfio instance_id=0 downtime=214462
> 353062@1721976489.463260:vmstate_downtime_load type=non-iterable 
> idstr=0000:00:01.0/vga instance_id=0 downtime=7522
> 353062@1721976489.575383:vmstate_downtime_load type=non-iterable 
> idstr=0000:00:02.0/vfio instance_id=0 downtime=112113
> 353062@1721976489.686961:vmstate_downtime_load type=non-iterable 
> idstr=0000:00:03.0/vfio instance_id=0 downtime=111545
> ...
> 
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> clear-bitmap-shift: 18
> Migration status: completed
> total time: 23510 ms
> downtime: 1018 ms
> setup: 380 ms
> transferred ram: 5317587 kbytes
> throughput: 1883.34 mbps
> remaining ram: 0 kbytes
> total ram: 209732424 kbytes
> duplicate: 51628634 pages
> skipped: 0 pages
> normal: 1159697 pages
> normal bytes: 4638788 kbytes
> dirty sync count: 4
> page size: 4 kbytes
> multifd bytes: 4653988 kbytes
> pages-per-second: 1150272
> precopy ram: 453652 kbytes
> downtime ram: 118 kbytes
> vfio device transferred: 209431 kbytes
> 
> As it can be seen from above, the downtime gets violated and the main
> contributors are vfio devices. Also it can very depending on the
> firmare version.

So have you figured out how VFIO affected this?  I thought VFIO will also
report pending information just like RAM, in ->state_pending_exact().

However I don't think such reports involves things like read()/write() over
the VFIO fds.  I assume that can be a major issue that won't be captured by
any form of pending data reports.  It might be useful to first identify /
verify this issue, then think about how to improve the estimation of
downtimes, e.g. by reporting "extra time" (in this case, the time to do
read() / write() from the device driver) needed for the final dump stage
for either iterative/non-iterative data.

> I have to note that in one of 10 tests, the ram downtime is being much
> larger and becomes then the outlier. This is being investigated
> currently.
> 
> This switchover overshoot is usually being reported as timed out queries.
> And to comment on precision, yes, this overshoot safeguard is not
> precise.
> In fact, the current implementation is not precise and has a granularity of 
> the savevm_
> handlers as it would only check the downtime overshoot after it is being
> completed. Maybe this part can be improved by delegating to some of the known
> abusers to observe the downtime overshoot on its own.
> 
> > 
> > > Personally, I think there's a tiny bit disconnect between what the user
> > > desires when setting downtime-limit vs what it really does. 
> > > downtime-limit right
> > > now looks to be best viewed as 'precopy-ram-downtime-limit' :)
> > 
> > That's fair to say indeed.. QEMU can try to do better on this, it's just
> > not yet straightforward to know how.
> 
> Could be that the better known part of the downtime (and predictable
> taken the bandwidth is accurate), i.e. currently what is downtime-limit, serve
> as a starting point and be named as ram-downtime-limit and everything
> else would have switchover allowance of downtime_limit - ram-downtime-limit?
> The correct value of ram-downtime-limit would take few iterations
> after dirty sync to get established.
> 
> I think that is somewhat similar to what Joao is thinking below in 1) and 2)?

I'm personally curious on whether you can nail the current issues down to
VFIO, and if so whether there can be something we can do already to improve
downtime estimation when VFIO is involved, irrelevant of the rest.

For a generic idea of improving downtime, I think as long as we stick with
downtime_limit parameter untouched (reflecting the total & max downtime),
then we can add some new ideas on top.

Thanks,

> 
> Thanks!
> 
> Elena
> > 
> > > Unless the accuracy work you're thinking is just having a better
> > > migration algorithm at obtaining the best possible downtime for
> > > outstanding-data/RAM *even if* downtime-limit is set at a high limit,
> > > like giving 1) a grace period in the beginning of migration post first
> > > dirty sync
> > 
> > Can you elaborate on this one a bit?
> 
> > 
> > > or 2) a measured value with continually incrementing target downtime
> > > limit until max downtime-limit set by user hits ... before defaulting to
> > > the current behaviour of migrating as soon as expected downtime is within
> > > the downtime-limit. As discussed in the last response, this could create
> > > the 'downtime headroom' for getting the enforcement/SLA better
> > > honored. Is this maybe your line of thinking?
> >
> 
> > Not what I was referring, but I think such logic existed for years, it was
> > just not implemented in QEMU.  I know at least OpenStack implemented
> > exactly that, where instead of keeping an internal smaller downtime_limit
> > and keep increasing that, OpenStack will keep adjusting downtime_limit
> > parameter from time to time, starting with a relatively low value.
> > 
> > That is also what I would suggest to most people who cares about downtime,
> > because QEMU does treat it pretty simple: if QEMU thinks it can switchover
> > within the downtime specified, QEMU will just do it, even if it's not the
> > best it can do.
> > 
> > Do you think such idea should be instead implemented in QEMU, too?  Note
> > that this will also be not about "making downtime accurate", but "reducing
> > downtime", it may depend on how we define downtime_limit in the context,
> > perhaps, where in OpenStack's case it simply won't directly feed that
> > parameter with the real max downtime the user allows.
> > 
> > Since that wasn't my original purpose, what I meant is simply see ways to
> > make downtime_limit accurate, and by analyzing the current downtimes (as
> > you mentioned, using the downtime tracepoints; and I'd say kudos to you on
> > suggesting that in a formal patch).
> > 
> > Here's something we collected by our QE team, for example, on a pretty
> > loaded system of 384 cores + 12TB:
> > 
> > Checkpoints analysis:
> > 
> >             downtime-start ->               vm-stopped:             
> > 267635.2 (us)
> >                 vm-stopped ->           iterable-saved:            
> > 3558506.2 (us)
> >             iterable-saved ->       non-iterable-saved:             
> > 270352.2 (us)
> >         non-iterable-saved ->             downtime-end:             
> > 144264.2 (us)
> >                                         total downtime:            
> > 4240758.0 (us)
> > 
> > Iterable device analysis:
> > 
> >   Device SAVE of                                      ram:  0 took    
> > 3470420 (us)
> > 
> > Non-iterable device analysis:
> > 
> >   Device SAVE of                                      cpu:121 took     
> > 118090 (us)
> >   Device SAVE of                                     apic:167 took       
> > 6899 (us)
> >   Device SAVE of                                      cpu:296 took       
> > 3795 (us)
> >   Device SAVE of             0000:00:02.2:00.0/virtio-blk:  0 took        
> > 638 (us)
> >   Device SAVE of                                      cpu:213 took        
> > 630 (us)
> >   Device SAVE of             0000:00:02.0:00.0/virtio-net:  0 took        
> > 534 (us)
> >   Device SAVE of                                      cpu:374 took        
> > 517 (us)
> >   Device SAVE of                                      cpu: 31 took        
> > 503 (us)
> >   Device SAVE of                                      cpu:346 took        
> > 497 (us)
> >   Device SAVE of             0000:00:02.0:00.1/virtio-net:  0 took        
> > 492 (us)
> >   (1341 vmsd omitted)
> > 
> > In this case we also see the SLA violations since we specified something
> > much lower than 4.2sec as downtime_limit.
> > 
> > This might not be a good example, as I think when capturing the traces we
> > used to still have the issue on inaccurate bw estimations, and that was why
> > I introduced switchover-bandwidth parameter, I wished after that the result
> > can be closer to downtime_limit but we never tried to test again.  I am not
> > sure either on whether that's the best way to address this.
> > 
> > But let's just ignore the iterable save() huge delays (which can be
> > explained, and hopefully will still be covered by downtime_limit
> > calculations when it can try to get closer to right), and we can also see
> > at least a few things we didn't account:
> > 
> >   - stop vm: 268ms
> >   - non-iterables: 270ms
> >   - dest load until complete: 144ms
> > 
> > For the last one, we did see another outlier where it can only be seen from
> > dest:
> > 
> > Non-iterable device analysis:
> > 
> >   Device LOAD of                              kvm-tpr-opt:  0 took     
> > 123976 (us)  <----- this one
> >   Device LOAD of              0000:00:02.0/pcie-root-port:  0 took       
> > 6362 (us)
> >   Device LOAD of             0000:00:02.0:00.0/virtio-net:  0 took       
> > 4583 (us)
> >   Device LOAD of             0000:00:02.0:00.1/virtio-net:  0 took       
> > 4440 (us)
> >   Device LOAD of                         0000:00:01.0/vga:  0 took       
> > 3740 (us)
> >   Device LOAD of                         0000:00:00.0/mch:  0 took       
> > 3557 (us)
> >   Device LOAD of             0000:00:02.2:00.0/virtio-blk:  0 took       
> > 3530 (us)
> >   Device LOAD of                   0000:00:02.1:00.0/xhci:  0 took       
> > 2712 (us)
> >   Device LOAD of              0000:00:02.1/pcie-root-port:  0 took       
> > 2046 (us)
> >   Device LOAD of              0000:00:02.2/pcie-root-port:  0 took       
> > 1890 (us)
> > 
> > So we found either cpu save() taking 100+ms, or kvm-tpr-opt load() taking
> > 100+ms.  None of them sounds normal, and I didn't look into them.
> >
> 
> 
> 
> > Now with a global ratio perhaps start to reflect "how much ratio of
> > downtime_limit should we account into data transfer", then we'll also need
> > to answer how the user should set that ratio value, and maybe there's a
> > sane way to calculate that by the VM setup?  I'm not sure, but those
> > questions may need to be answered together in the next post, so that such
> > parameter can be consumable.
> 
> > 
> > The answer doesn't need to be accurate, but I hope that can be based on
> > some similar analysis like above (where I didn't do it well; as I don't
> > think I looked into any of the issues, and maybe they're fix-able).  But
> > just to show what I meant.  It'll be also great when doing the analysis we
> > found issues fix-able, then it'll be great we fix the issues intead.
> > That's the part when I mentioned "I still prefer fixing downtime_limit
> > itself".
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> > 
> 

-- 
Peter Xu
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded, Peter Xu <=
Prev by Date: Re: [PATCH] ppc: fixed incorrect name filed in vmstate_tlbemb_entry
Next by Date: Re: [PATCH v3 02/14] migration/multifd: Inline page_size and page_count
Previous by thread: [PATCH] tests/avocado: Allow running user-mode tests
Next by thread: Re: [PATCH v5 11/13] migration/multifd: Add migration option set packet size.
Index(es):
- Date
- Thread