Re: [PULL 11/13] ui/console: Remove dpy_cursor_define

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PULL 11/13] ui/console: Remove dpy_cursor_define_supported()

From:	Hank Knox
Subject:	Re: [PULL 11/13] ui/console: Remove dpy_cursor_define_supported()
Date:	Thu, 9 Jan 2025 09:34:41 -0500
User-agent:	Mozilla Thunderbird

Dear Michael W et al,

Thank you for following up on this. While waiting for a fix, I have beenusing a build I made myself that applies the patch listed at the bottomof this email to the v9.2.0 branch of the code (which has dpy_cursor_define_supported() removed) and it has been working fine eversince. I also have the display scaled to 150%. I don't have any otherinsights or clues into causes or fixes.

Since I last responded to the bug report I filed on the Debian bugtracker (#1084199), I have read on the Debian User Forums (where I alsowent looking for help) that several other people are having trouble withthis issue. They found, like you did, that reverting to 9.0.2 fixed theproblem. You can find the thread athttps://forums.debian.net/viewtopic.php?t=160631 but I don't see any newclues there.

I also am very happy to test any possible patches and to gather anyother information, to the limits of my tech ability.


Best,

Hank


On 1/9/25 08:58, Michael Weiser wrote:

Hello Hank, Michael, Phil,

Remove dpy_cursor_define_supported() as it brings no benefit today and
it has a few inherent problems.

commit 4bba839808bb1c4f500a11462220a687b4d9ab25
Author: Akihiko Odaki <akihiko.odaki@daynix.com>
Date:   Mon Jul 15 14:25:45 2024 +0900

     ui/console: Remove dpy_cursor_define_supported()

Apparently this commit made windows10 guest to freeze.  There's a rather
hairy bugreport at https://bugs.debian.org/1084199 .  Also there's an
interesting bug filed against qemu,
https://gitlab.com/qemu-project/qemu/-/issues/1628 ,
which seems to be relevant.

Thanks for looking into this! I am now also affected by this bug and
highly motivated to resolve it. :)

I recently updated my Gentoo Linux system which included an update of
qemu from 9.0.2 to 9.2.0. After that I began to experience the issue
reported by Hank with a Windows 10 VM in libvirt using QXL graphics with
SPICE in virt-viewer. The Windows is fully updated and I've tried
installing the most recent guest drivers to no avail
(virtio-win-0.1.266.iso).

I've reconfirmed the issue with a freshly installed Windows 11, fully
updated and the same driver ISO.

Downgrading to 9.0.4 makes it go away. Downgrading to 9.1.2 does not.
Reverting above commit off of 9.2.0 as a custom patch to the Gentoo
package makes it go away as well.

At this point I grabbed the git repo and started another bisect between
9.0.0 and 9.1.0. During that I found a good "reproducer" to be to
frantically click on all the application icons on my desktop as fast as
possible (Firefox, Edge, LibreOffice, Chrome and PuTTY, FWIW). Apart
from a lot of CPU load, disk I/O and memory pressure it also causes
frequent cursor changes from pointer to spinning wheel to pointer with
spinning wheel. If nothing else, it helps pass the time to the freeze.
:)

With that I ended up at exactly the same commit as Hank found above.
Reverting that commit off of current devel HEAD makes the problem go
away as well. With vanilla devel HEAD the freezes persist/come back.

I can also confirm that the issue has to do with scaling of Windows UI
elements. At 100% the freezes to not appear (or at least not so I can
trigger them with my "reproducer"). At 150% or 200% scaling I can
trigger them quite quickly (< 30s).

Also, identically to Hank's findings, the VM continues to respond to
ICMP requests (ping) as well as agent requests from virsh (e.g.
guestinfo). A shutdown command however hangs/times out.

On Tue, Oct 29, 2024 at 03:04:29PM +0100, Phil Dennis-Jordan wrote:

Can we get the user to set qxl->debug to a value > 1 and see if the freeze
coincides with logging from here? (Possibly tricky to intercept the fprintf
output from Qemu run via libvirt though.)

How would I do that? On the source level or is there an environment
variable/command line option?

Given that "The time before the freeze seems to be random, from a few
seconds to a couple of minutes." there is a possibility of a false
negative during the bisect. (i.e. commit marked GOOD that should be BAD
because it happened to not hit the freeze in the usual time)

I went to the commit before this one and the issue disappeared. Also the
positive effect of reverting it off of HEAD seems to suggest that if not
the main culprit it at least makes the possibly underlying issue
surface.

We could ask the user to check whether there's any connection with mouse
cursor changes, e.g. whether he can more easily provoke the issue by
perform actions that rapidly change the mouse cursor. (For example by
visiting https://developer.mozilla.org/en-US/docs/Web/CSS/cursor in the
guest and moving back and forth over the test area.)

I've extracted the IFrame URL
https://interactive-examples.mdn.mozilla.net/pages/css/cursor.html from
this and played with it for some time.

On an idling system the cursor changes do not seem to be enough to
trigger the issue. Once I start to put load on the system by starting
applications as per my "reproducer" I can no longer be sure if and how
cursor changes play into it because lots of windows start popping up.
All hangs I can remember have been showing the segmented spinning blue
wheel animated cursor though.

Is there an easy way to take a sampling profile on Linux that will show us
stack traces of all the threads in the frozen Qemu process? On macOS this
is easy with the Activity Monitor GUI or iprofiler on the command line.
That ought to confirm whether it's a deadlock or indefinite wait in some
Qemu subsystem.

The stuck qemu still does things at about 3% CPU load.
I can attach to it with gdb and pull the thread list below.
Do any of those look interesting to you?

(gdb) info threads
   Id   Target Id                                            Frame
* 1    Thread 0x7f0eada740c0 (LWP 741887) "qemu-system-x86" 0x00007f0eaf73e656 
in ppoll () from /usr/lib64/libc.so.6
   2    Thread 0x7f0ccdffb6c0 (LWP 742004) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   3    Thread 0x7f0cceffd6c0 (LWP 742002) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   4    Thread 0x7f0ced7fa6c0 (LWP 741998) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   5    Thread 0x7f0cee7fc6c0 (LWP 741996) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   6    Thread 0x7f0ceffff6c0 (LWP 741993) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   7    Thread 0x7f0d0d7fa6c0 (LWP 741991) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   8    Thread 0x7f0d0e7fc6c0 (LWP 741989) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   9    Thread 0x7f0d0f7fe6c0 (LWP 741987) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   10   Thread 0x7f0d2d7fa6c0 (LWP 741984) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   11   Thread 0x7f0d2f7fe6c0 (LWP 741980) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   12   Thread 0x7f0d2ffff6c0 (LWP 741917) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   13   Thread 0x7f0d514f76c0 (LWP 741915) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   14   Thread 0x7f0d52cfa6c0 (LWP 741912) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   15   Thread 0x7f0d534fb6c0 (LWP 741911) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   16   Thread 0x7f0d53cfc6c0 (LWP 741910) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   17   Thread 0x7f0d591986c0 (LWP 741905) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   18   Thread 0x7f0d6531e6c0 (LWP 741902) "worker"          0x00007f0eaf6d785e 
in ?? () from /usr/lib64/libc.so.6
   19   Thread 0x7f0d9afff6c0 (LWP 741900) "SPICE Worker"    0x00007f0eaf73e656 
in ppoll () from /usr/lib64/libc.so.6
   20   Thread 0x7f0ea89ff6c0 (LWP 741898) "CPU 1/KVM"       0x00007f0eaf74534f 
in ioctl () from /usr/lib64/libc.so.6
   21   Thread 0x7f0ea95a96c0 (LWP 741897) "CPU 0/KVM"       0x00007f0eaf74534f 
in ioctl () from /usr/lib64/libc.so.6
   22   Thread 0x7f0ea9daa6c0 (LWP 741896) "IO mon_iothread" 0x00007f0eaf73e656 
in ppoll () from /usr/lib64/libc.so.6
   23   Thread 0x7f0eada740c0 (LWP 741895) "vhost-741887"    0x0000000000000000 
in ?? ()
   24   Thread 0x7f0eada740c0 (LWP 741894) "kvm-nx-lpage-re" 0x0000000000000000 
in ?? ()
   25   Thread 0x7f0eab8356c0 (LWP 741892) "qemu-system-x86" 0x00007f0eaf74776d 
in syscall () from /usr/lib64/libc.so.6

This is right after the display gets stuck. The workers die down over time.

Michael, what's the situation with the patch you suggested in your comment
on the Qemu bug:
https://gitlab.com/qemu-project/qemu/-/issues/1628#note_2144606625 ? Is
there any chance we can get the Debian user to try that?

This patch on top of current devel HEAD (as well as directly on top of
the commit in question) makes it worse: The freezes start happening
immediately after the desktop shell is started. I think I've even seen it
freeze when the boot logo and spinner were still showing, possibly when
the (also scaled) login screen tries to initialise.

I'm out of my depth further narrowing down the cause and standing by to
try whatever you tell me.


--
Hank Knox, FRSC
Schulich School of Music of
McGill University (retired)
Montreal, QC

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PULL 11/13] ui/console: Remove dpy_cursor_define_supported(), Michael Weiser, 2025/01/09
- Re: [PULL 11/13] ui/console: Remove dpy_cursor_define_supported(), Hank Knox <=

Prev by Date: Re: [PATCH] MAINTAINERS: Add me as the maintainer for ivshmem-flat
Next by Date: Re: [PATCH v2] hw/char: stm32f2xx_usart: replace print with trace
Previous by thread: Re: [PULL 11/13] ui/console: Remove dpy_cursor_define_supported()
Next by thread: Re: [RFC PATCH v3 04/11] contrib/plugins: add plugin showcasing new dicontinuity related API
Index(es):
- Date
- Thread