Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page

From:	David Hildenbrand
Subject:	Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page
Date:	Tue, 28 Jan 2025 19:45:20 +0100
User-agent:	Mozilla Thunderbird

Yes, we can collect the information from the block associated to this
ram_addr. But instead of duplicating the necessary code into both i386
and ARM, I came back to adding the change into the
kvm_hwpoison_page_add() function called from both i386 and ARM specific
code.

I also needed a new possibility to retrieve the information while we are
dealing with the SIGBUS signal, and created a new function to gather the
information from the RAMBlock:
qemu_ram_block_location_info_from_addr(ram_addr_t ram_addr,
                                         struct RAMBlockInfo *b_info)
with the associated struct.

So that we can use the RCU_READ_LOCK_GUARD() and retrieve all the data.


Makes sense.



Note about ARM failing on large pages:
----------=====----------------------
I could test that ARM VMs impacted by memory errors on a large
underlying memory page, can end up looping on reporting the error:
The VM encountering an error has a high probability to crash and can try
to save a vmcore with a kdump phase.

Yeah, that's what I thought. If you rip out 1 GiB of memory, your VM isgoing to have a bad time :/


This fix introduces qemu messages reporting errors when they are relayed
to the VM.
A large page being poisoned by an error on ARM can make a VM loop on the
vmcore collection phase and the console would show messages like that
appearing every 10 seconds (before the change):

   vvv
           Starting Kdump Vmcore Save Service...
[    3.095399] kdump[445]: Kdump is using the default log level(3).
[    3.173998] kdump[481]: saving to
/sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/
[    3.189683] kdump[486]: saving vmcore-dmesg.txt to
/sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/
[    3.213584] kdump[492]: saving vmcore-dmesg.txt complete
[    3.220295] kdump[494]: saving vmcore
[   10.029515] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60
offset:0x0 grain:1 - APEI location: )
[   10.033647] [Firmware Warn]: GHES: Invalid address in generic error
data: 0x116c60000
[   10.036974] {2}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   10.040514] {2}[Hardware Error]: event severity: recoverable
[   10.042911] {2}[Hardware Error]:  Error 0, type: recoverable
[   10.045310] {2}[Hardware Error]:   section_type: memory error
[   10.047666] {2}[Hardware Error]:   physical_address: 0x0000000116c60000
[   10.050486] {2}[Hardware Error]:   error_type: 0, unknown
[   20.053205] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60
offset:0x0 grain:1 - APEI location: )
[   20.057416] [Firmware Warn]: GHES: Invalid address in generic error
data: 0x116c60000
[   20.060781] {3}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
[   20.065472] {3}[Hardware Error]: event severity: recoverable
[   20.067878] {3}[Hardware Error]:  Error 0, type: recoverable
[   20.070273] {3}[Hardware Error]:   section_type: memory error
[   20.072686] {3}[Hardware Error]:   physical_address: 0x0000000116c60000
[   20.075590] {3}[Hardware Error]:   error_type: 0, unknown
   ^^^

with the fix, we now have a flood of messages like:

   vvv
qemu-system-aarch64: Memory Error on large page from
ram-node1:d5e00000+0 +200000
qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and
GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
qemu-system-aarch64: Memory Error on large page from
ram-node1:d5e00000+0 +200000
qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and
GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
qemu-system-aarch64: Memory Error on large page from
ram-node1:d5e00000+0 +200000
qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and
GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected
   ^^^


In both cases, this situation loops indefinitely !

I'm just informing of a change of behavior, fixing this issue would most
probably require VM kernel modifications  or a work-around in qemu when
errors are reported too often, but is out of the scope of this current
qemu fix.

Agreed. I think one problem is that kdump cannot really cope with newmemory errors (it tries to not touch pages that had a memory error inthe old kernel).

Maybe this is also due to the fact that we inform the kernel only abouta single page vanishing, whereby actually a whole 1 GiB is vanishing.


--
Cheers,

David / dhildenb

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap(), (continued)
- [PATCH v5 5/6] hostmem: Factor out applying settings, “William Roche, 2025/01/10
- [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot, “William Roche, 2025/01/10
  - Re: [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot, David Hildenbrand, 2025/01/14
    - Re: [PATCH v5 2/6] system/physmem: poisoned memory discard on reboot, William Roche, 2025/01/27
- [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page, “William Roche, 2025/01/10
  - Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page, David Hildenbrand, 2025/01/14
    - Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page, William Roche, 2025/01/27
    - Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page, David Hildenbrand <=
- [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap(), “William Roche, 2025/01/10
- Re: [PATCH v5 0/6] Poisoned memory recovery on reboot, David Hildenbrand, 2025/01/14
  - Re: [PATCH v5 0/6] Poisoned memory recovery on reboot, William Roche, 2025/01/27

Prev by Date: Re: [PATCH v5 1/6] system/physmem: handle hugetlb correctly in qemu_ram_remap()
Next by Date: Re: [PATCH 01/11] hw/sd/omap_mmc: Do a minimal conversion to QDev
Previous by thread: Re: [PATCH v5 3/6] accel/kvm: Report the loss of a large memory page
Next by thread: [PATCH v5 4/6] numa: Introduce and use ram_block_notify_remap()
Index(es):
- Date
- Thread