[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH v5 0/6] Poisoned memory recovery on reboot
From: |
“William Roche |
Subject: |
[PATCH v5 0/6] Poisoned memory recovery on reboot |
Date: |
Fri, 10 Jan 2025 21:13:59 +0000 |
From: William Roche <william.roche@oracle.com>
Hello David,
I'm keeping the description of the patch set you already reviewed:
---
This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs and the generic memory recovery
on VM reset.
When using hugetlbfs large pages, any large page location being impacted
by an HW memory error results in poisoning the entire page, suddenly
making a large chunk of the VM memory unusable.
The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.
In order to fix this issue, we take into account the page size of the
impacted memory block when dealing with the associated poisoned page
location.
Using the page size information we also try to regenerate the memory
calling ram_block_discard_range() on VM reset when running
qemu_ram_remap(). So that a poisoned memory backed by a hugetlbfs
file is regenerated with a hole punched in this file. A new page is
loaded when the location is first touched.
In case of a discard failure we fall back to remapping the memory
location. We also have to reset the memory settings and honor the
'prealloc' attribute.
This memory setting is performed by a new remap notification mechanism
calling host_memory_backend_ram_remapped() function when a region of
a memory block is remapped.
We also enrich the messages used to report a memory error relayed to
the VM, providing an identification of memory page and its size in
case of a large page impacted.
----
v4->v5
. Updated commit messages (for patches 1, 5 and 6)
. Fixed comment typo of patch 2
. Changed the fall back function parameters to match the
ram_block_discard_range() function.
. Removed the unused case of remapping a file in this function
. add the assert(block->fd < 0) in this function too
. I merged my patch 7 with you patch 6 (we only have 6 patches now)
This code is scripts/checkpatch.pl clean
'make check' runs clean on both x86 and ARM.
David Hildenbrand (3):
numa: Introduce and use ram_block_notify_remap()
hostmem: Factor out applying settings
hostmem: Handle remapping of RAM
William Roche (3):
system/physmem: handle hugetlb correctly in qemu_ram_remap()
system/physmem: poisoned memory discard on reboot
accel/kvm: Report the loss of a large memory page
accel/kvm/kvm-all.c | 2 +-
backends/hostmem.c | 189 +++++++++++++++++++++++---------------
hw/core/numa.c | 11 +++
include/exec/cpu-common.h | 3 +-
include/exec/ramlist.h | 3 +
include/system/hostmem.h | 1 +
system/physmem.c | 82 ++++++++++++-----
target/arm/kvm.c | 13 +++
target/i386/kvm/kvm.c | 18 +++-
9 files changed, 218 insertions(+), 104 deletions(-)
--
2.43.5
- [PATCH v5 0/6] Poisoned memory recovery on reboot,
“William Roche <=
- [PATCH v5 6/6] hostmem: Handle remapping of RAM, “William Roche, 2025/01/10
- [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap(), “William Roche, 2025/01/10
- [PATCH v5 5/6] hostmem: Factor out applying settings, “William Roche, 2025/01/10
- [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot, “William Roche, 2025/01/10
- [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page, “William Roche, 2025/01/10
- [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap(), “William Roche, 2025/01/10