[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
From: |
David Hildenbrand |
Subject: |
Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon |
Date: |
Mon, 30 May 2022 09:41:22 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 |
On 27.05.22 08:32, zhenwei pi wrote:
> On 5/27/22 02:37, Peter Xu wrote:
>> On Wed, May 25, 2022 at 01:16:34PM -0700, Jue Wang wrote:
>>> The hypervisor _must_ emulate poisons identified in guest physical
>>> address space (could be transported from the source VM), this is to
>>> prevent silent data corruption in the guest. With a paravirtual
>>> approach like this patch series, the hypervisor can clear some of the
>>> poisoned HVAs knowing for certain that the guest OS has isolated the
>>> poisoned page. I wonder how much value it provides to the guest if the
>>> guest and workload are _not_ in a pressing need for the extra KB/MB
>>> worth of memory.
>>
>> I'm curious the same on how unpoisoning could help here. The reasoning
>> behind would be great material to be mentioned in the next cover letter.
>>
>> Shouldn't we consider migrating serious workloads off the host already
>> where there's a sign of more severe hardware issues, instead?
>>
>> Thanks,
>>
>
> I'm maintaining 1000,000+ virtual machines, from my experience:
> UE is quite unusual and occurs randomly, and I did not hit UE storm case
> in the past years. The memory also has no obvious performance drop after
> hitting UE.
>
> I hit several CE storm case, the performance memory drops a lot. But I
> can't find obvious relationship between UE and CE.
>
> So from the point of my view, to fix the corrupted page for VM seems
> good enough. And yes, unpoisoning several pages does not help
> significantly, but it is still a chance to make the virtualization better.
>
I'm curious why we should care about resurrecting a handful of poisoned
pages in a VM. The cover letter doesn't touch on that.
IOW, I'm missing the motivation why we should add additional
code+complexity to unpoison pages at all.
If we're talking about individual 4k pages, it's certainly sub-optimal,
but does it matter in practice? I could understand if we're losing
megabytes of memory. But then, I assume the workload might be seriously
harmed either way already?
I assume when talking about "the performance memory drops a lot", you
imply that this patch set can mitigate that performance drop?
But why do you see a performance drop? Because we might lose some
possible THP candidates (in the host or the guest) and you want to plug
does holes? I assume you'll see a performance drop simply because
poisoning memory is expensive, including migrating pages around on CE.
If you have some numbers to share, especially before/after this change,
that would be great.
--
Thanks,
David / dhildenb
- Re: [PATCH 3/3] virtio_balloon: Introduce memory recover, (continued)
- Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, David Hildenbrand, 2022/05/24
- Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, zhenwei pi, 2022/05/26
- Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, Jue Wang, 2022/05/25
- Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, Peter Xu, 2022/05/26
- Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, zhenwei pi, 2022/05/27
- Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon,
David Hildenbrand <=
- Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, zhenwei pi, 2022/05/30
- Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, Peter Xu, 2022/05/30
- Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, Jue Wang, 2022/05/31
- Re: Re: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon, zhenwei pi, 2022/05/31