[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x
From: |
Oliver Francke |
Subject: |
Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686] |
Date: |
Sun, 4 Aug 2013 15:36:52 +0200 |
Hi Mike,
you might be the guy StefanHa was referring to on the qemu-devel mailing-list.
I just made some more tests, so…
Am 02.08.2013 um 23:47 schrieb Mike Dawson <address@hidden>:
> Oliver,
>
> We've had a similar situation occur. For about three months, we've run
> several Windows 2008 R2 guests with virtio drivers that record video
> surveillance. We have long suffered an issue where the guest appears to hang
> indefinitely (or until we intervene). For the sake of this conversation, we
> call this state "wedged", because it appears something (rbd, qemu, virtio,
> etc) gets stuck on a deadlock. When a guest gets wedged, we see the following:
>
> - the guest will not respond to pings
If showing up the hung_task - message, I can ping and establish new
ssh-sessions, just the session with a while loop does not accept any
keyboard-action.
> - the qemu-system-x86_64 process drops to 0% cpu
> - graphite graphs show the interface traffic dropping to 0bps
> - the guest will stay wedged forever (or until we intervene)
> - strace of qemu-system-x86_64 shows QEMU is making progress [1][2]
>
nothing special here:
5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=6, events=POLLIN}, {fd=19,
events=POLLIN}, {fd=15, events=POLLIN}, {fd=4, events=POLLIN}], 11, -1) = 1
([{fd=12, revents=POLLIN}])
[pid 11793] read(5, 0x7fff16b61f00, 16) = -1 EAGAIN (Resource temporarily
unavailable)
[pid 11793] read(12,
"\2\0\0\0\0\0\0\0\0\0\0\0\0\361p\0\252\340\374\373\373!gH\10\0E\0\0Yq\374"...,
69632) = 115
[pid 11793] read(12, 0x7f0c1737fcec, 69632) = -1 EAGAIN (Resource temporarily
unavailable)
[pid 11793] poll([{fd=27, events=POLLIN|POLLERR|POLLHUP}, {fd=26,
events=POLLIN|POLLERR|POLLHUP}, {fd=24, events=POLLIN|POLLERR|POLLHUP}, {fd=12,
events=POLLIN|POLLERR|POLLHUP}, {fd=3, events=POLLIN|POLLERR|POLLHUP}, {fd=
and that for many, many threads.
Inside the VM I see 75% wait, but I can restart the spew-test in a second
session.
All that tested with rbd_cache=false,cache=none.
I also test every qemu-version with a 2 CPU 2GiB mem Windows 7 VM with some
high load, encountering no problem ATM. Running smooth and fast.
> We can "un-wedge" the guest by opening a NoVNC session or running a 'virsh
> screenshot' command. After that, the guest resumes and runs as expected. At
> that point we can examine the guest. Each time we'll see:
>
> - No Windows error logs whatsoever while the guest is wedged
> - A time sync typically occurs right after the guest gets un-wedged
> - Scheduled tasks do not run while wedged
> - Windows error logs do not show any evidence of suspend, sleep, etc
>
> We had so many issue with guests becoming wedged, we wrote a script to 'virsh
> screenshot' them via cron. Then we installed some updates and had a month or
> so of higher stability (wedging happened maybe 1/10th as often). Until today
> we couldn't figure out why.
>
> Yesterday, I realized qemu was starting the instances without specifying
> cache=writeback. We corrected that, and let them run overnight. With RBD
> writeback re-enabled, wedging came back as often as we had seen in the past.
> I've counted ~40 occurrences in the past 12-hour period. So I feel like
> writeback caching in RBD certainly makes the deadlock more likely to occur.
>
> Joshd asked us to gather RBD client logs:
>
> "joshd> it could very well be the writeback cache not doing a callback at
> some point - if you could gather logs of a vm getting stuck with debug rbd =
> 20, debug ms = 1, and debug objectcacher = 30 that would be great"
>
> We'll do that over the weekend. If you could as well, we'd love the help!
>
> [1] http://www.gammacode.com/kvm/wedged-with-timestamps.txt
> [2] http://www.gammacode.com/kvm/not-wedged.txt
>
As I wrote above, no cache so far, so omitting the verbose debugging at the
moment. But will do if requested.
Thnx for your report,
Oliver.
> Thanks,
>
> Mike Dawson
> Co-Founder & Director of Cloud Architecture
> Cloudapt LLC
> 6330 East 75th Street, Suite 170
> Indianapolis, IN 46250
>
> On 8/2/2013 6:22 AM, Oliver Francke wrote:
>> Well,
>>
>> I believe, I'm the winner of buzzwords-bingo for today.
>>
>> But seriously speaking... as I don't have this particular problem with
>> qcow2 with kernel 3.2 nor qemu-1.2.2 nor newer kernels, I hope I'm not
>> alone here?
>> We have a raising number of tickets from people reinstalling from ISO's
>> with 3.2-kernel.
>>
>> Fast fallback is to start all VM's with qemu-1.2.2, but we then lose
>> some features ala latency-free-RBD-cache ;)
>>
>> I just opened a bug for qemu per:
>>
>> https://bugs.launchpad.net/qemu/+bug/1207686
>>
>> with all dirty details.
>>
>> Installing a backport-kernel 3.9.x or upgrade Ubuntu-kernel to 3.8.x
>> "fixes" it. So we have a bad combination for all distros with 3.2-kernel
>> and rbd as storage-backend, I assume.
>>
>> Any similar findings?
>> Any idea of tracing/debugging ( Josh? ;) ) very welcome,
>>
>> Oliver.
>>
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686],
Oliver Francke <=
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Stefan Hajnoczi, 2013/08/05
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Mike Dawson, 2013/08/05
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Sage Weil, 2013/08/13
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], James Harper, 2013/08/13
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Oliver Francke, 2013/08/08
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Josh Durgin, 2013/08/08
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Oliver Francke, 2013/08/09
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Andrei Mikhailovsky, 2013/08/09
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Stefan Hajnoczi, 2013/08/09
- Re: [Qemu-devel] [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Bug 1207686], Josh Durgin, 2013/08/10