[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unpredictable performance degradation in QEMU KVMs
From: |
Frantisek Rysanek |
Subject: |
Re: Unpredictable performance degradation in QEMU KVMs |
Date: |
Thu, 07 Oct 2021 08:12:07 +0200 |
A couple more points:
How many CPU's (sockets) does your motherboard have?
Multi-socket machines are more or less in the NUMA territory.
Suboptimal process scheduling / memory allocation "decisions" (on
part of the CPU/process scheduler) can have "interesting effects".
Think of processes migrating between NUMA nodes... (a handful of CPU
cores coupled to a local memory controller) repeatedly - yummy.
Anti-patterns like this should not really happen though.
A possible keyword here is the CPU-process *affinity*.
Next, I may mention the emulated peripherals = hardware other than
the CPU instruction set. The VGA adaptor, NIC's, storage
controllers... needless to say, I've never seen a problem with these,
matching your problem description. If an emulated peripheral doesn't
work for your guest OS, generally it hangs during boot already = it
just doesn't work at all. If it does work, it tends to be blazing
fast and efficient, compared to historical real hardware :-)
Speaking of "unexpected performance degradations" makes me think of
memory garbage-collection runs. Typically I'd expect this in modern
"interpreted" programming languages (runtime environments of those).
Java, .NET, probably Python as well - although not all
garbage-collection mechanisms do periodic cleanup. Rather, GC based
on reference-counting works continuously as the references are being
created and removed by the user program running...
In theory, something along the lines of GC can happen in the kernel
too, in the "virtual memory management" department - it's called
"compaction".
https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#co
mpaction
https://pingcap.com/blog/linux-kernel-vs-memory-fragmentation-2
Compaction should result in faster "page table lookups", by
decreasing the fragmentation of the mapping between physical RAM and
virtual memory (as allocated to user-space processes and the kernel
itself).
If a "compaction run" gets triggered, it's hard for me to tell how
long this can take to finish. Remember that it takes place completely
in the RAM. It shouldn't be nearly as bad as e.g. disk IO stalling
due to insufficient IOps, or thermal throttling.
How to know that it's compaction, hampering your performance, while
this is going on? Hmm. I'd look at the current CPU consumption on
part of kcompactd :
https://lwn.net/Articles/817905/
https://lore.kernel.org/lkml/20190126200005.GB27513@amd/T/
Now... how this works in a virtualized environment, that's a good
question to me :-) If the whole virtual memory allocation clockwork
(multiple layers of page tables) and the compaction "GC" works in two
layers, once for the host and once for the guest? Is there possibly
some host-to-guest coordination? But that would break the rules of
virtualization, right? Or, does the host just allocate a single "huge
page" to the guest anyway, and not care anymore? Does it actually? I
recall earlier debates about sparse allocation / overprovisioning
(host to guest) and ballooning and all that jazz... Good questions.
Maybe look at swapping activity and kcompactd CPU consumption in the
host instance and in the guest instances, separately?
What about swapping, triggered by a dynamically occuring low memory
condition in the guest VM? That would show up in the disk IO stats
(iops skyrocketing).
Speaking of disk IO, back in the heyday of RAID built on top of
spinning rust, I remember instances where an individual physical
drive in a RAID would start to struggle. In our RAID boxes, that had
an activity LED (and a failure LED) per physical drive, this would
show quite clearly: while the struggling drive hasn't died yet, the
whole RAID would merely slow down (noticeably), individual healthy
disk drives would just blink their activity LED every now and then,
but the culprit drive's LED would indicate busy activity :-)
This pattern under heavy sequential load (synthetic/deliberate),
intended to use whatever bandwidth the RAID has to offer.
And that's about it for now... :-)
Frank