[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] Live migration debugging
From: |
Paul Boven |
Subject: |
[Qemu-devel] Live migration debugging |
Date: |
Tue, 29 Jul 2014 13:31:46 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 |
Hi folks,
Recently there's been several patches to fix kvmclock issues during
migrations, which were subsequently reverted. I hope the observations
below can be helpful in pinning down the actual issues to make live
migration work again in the future.
Live migration has been broken since at least release 1.4.0 (as shipped
with Ubuntu 13.04), and still has the same problems in 2.1.0-rc2, but
briefly worked in 2.0-git-20140609.
The problem is that once the live migration is complete and the guest
gets started on the destination server, it will hang for a long time,
consuming 100% cpu. This can be mere seconds, but I've also observed
hangs for as long as 11 minutes. And then suddenly the guest starts to
respond again as if nothing happens, but its clock has not progressed at
all while the machine was hanging.
What I have observed is that the time spent hanging is exactly the
difference between the clock rate of the host, and the 'real' (NTP)
time. If you multiply the time since the previous migration with the PPM
offset as determined by NTP (see /var/lib/ntp/ntp.drift), that is
exactly how many seconds the guest will spend at 100% CPU before
becoming responsive again. I have observed this on two different pairs
of KVM servers. Each of the servers has a negative PPM value according
to NTP.
Example: a guest having nearly 9 days of uptime, with (according to NTP)
a clock rate of -34 ppm, froze for 27 seconds when I migrated it. I have
done quite a few test migrations, and this relationship holds quite
precisely.
As the duration of the freeze is proportional to the time since the
previous migration, debugging is a bit difficult as you have to wait a
while before you can demonstrate the problem. It is also probably a
reason this problem is underreported, because it is not very noticeable
if you do it right after starting the VM, but looks like a complete
crash if you have a few months of uptime.
With the 2.0 sources from 2014-06-09, the problem does *not* occur. A
side-effect of that patch is that the guest clock has a lot of jitter
until the first migration, but behaves normally (yet without hangs) on
subsequent migrations.
Is there a way that I can directly read the kvmclock from the guest or
host, so we can compare them before and after migration, and see what
goes wrong precisely?
See also https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1297218
Regards, Paul Boven.
--
Paul Boven <address@hidden> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Qemu-devel] Live migration debugging,
Paul Boven <=