[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] hostmem: Round up memory size for qemu_madvise() in host_mem
From: |
Michal Prívozník |
Subject: |
Re: [PATCH] hostmem: Round up memory size for qemu_madvise() in host_memory_backend_memory_complete() |
Date: |
Mon, 4 Dec 2023 15:12:32 +0100 |
User-agent: |
Mozilla Thunderbird |
On 12/4/23 10:21, David Hildenbrand wrote:
> On 01.12.23 10:07, Michal Prívozník wrote:
>> On 11/27/23 14:55, David Hildenbrand wrote:
>>> On 27.11.23 14:37, David Hildenbrand wrote:
>>>> On 27.11.23 13:32, Michal Privoznik wrote:
>>>>> Simple reproducer:
>>>>> qemu.git $ ./build/qemu-system-x86_64 \
>>>>> -m size=8389632k,slots=16,maxmem=25600000k \
>>>>> -object
>>>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}'
>>>>> \
>>>>> -numa node,nodeid=0,cpus=0,memdev=ram-node0
>>>>>
>>>>> With current master I get:
>>>>>
>>>>> qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid
>>>>> argument
>>>>>
>>>>> The problem is that memory size (8193MiB) is not an integer
>>>>> multiple of underlying pagesize (2MiB) which triggers a check
>>>>> inside of madvise(), since we can't really set a madvise() policy
>>>>> just to a fraction of a page.
>>>>
>>>> I thought we would just always fail create something that doesn't
>>>> really
>>>> make any sense.
>>>>
>>>> Why would we want to support that case?
>>>>
>>>> Let me dig, I thought we would have had some check there at some point
>>>> that would make that fail (especially: RAM block not aligned to the
>>>> pagesize).
>>>
>>>
>>> At least memory-backend-memfd properly fails for that case:
>>>
>>> $ ./build/qemu-system-x86_64 -object
>>> memory-backend-memfd,hugetlb=on,size=3m,id=tmp
>>> qemu-system-x86_64: failed to resize memfd to 3145728: Invalid argument
>>>
>>> memory-backend-file ends up creating a new file:
>>>
>>> $ ./build/qemu-system-x86_64 -object
>>> memory-backend-file,share=on,mem-path=/dev/hugepages/tmp,size=3m,id=tmp
>>>
>>> $ stat /dev/hugepages/tmp
>>> File: /dev/hugepages/tmp
>>> Size: 4194304 Blocks: 0 IO Block: 2097152 regular
>>> file
>>>
>>> ... and ends up sizing it properly aligned to the huge page size.
>>>
>>>
>>> Seems to be due to:
>>>
>>> if (memory < block->page_size) {
>>> error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be
>>> equal to "
>>> "or larger than page size 0x%zx",
>>> memory, block->page_size);
>>> return NULL;
>>> }
>>>
>>> memory = ROUND_UP(memory, block->page_size);
>>>
>>> /*
>>> * ftruncate is not supported by hugetlbfs in older
>>> * hosts, so don't bother bailing out on errors.
>>> * If anything goes wrong with it under other filesystems,
>>> * mmap will fail.
>>> *
>>> * Do not truncate the non-empty backend file to avoid corrupting
>>> * the existing data in the file. Disabling shrinking is not
>>> * enough. For example, the current vNVDIMM implementation stores
>>> * the guest NVDIMM labels at the end of the backend file. If the
>>> * backend file is later extended, QEMU will not be able to find
>>> * those labels. Therefore, extending the non-empty backend file
>>> * is disabled as well.
>>> */
>>> if (truncate && ftruncate(fd, offset + memory)) {
>>> perror("ftruncate");
>>> }
>>>
>>> So we create a bigger file and map the bigger file and also have a
>>> RAMBlock that is bigger. So we'll also consume more memory.
>>>
>>> ... but the memory region is smaller and we tell the VM that it has
>>> less memory. Lot of work with no obvious benefit, and only some
>>> memory waste :)
>>>
>>>
>>> We better should have just rejected such memory backends right from
>>> the start. But now it's likely too late.
>>>
>>> I suspect other things like
>>> * qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE);
>>> * qemu_madvise(ptr, sz, QEMU_MADV_DONTDUMP);
>>>
>>> fail, but we don't care for hugetlb at least regarding merging
>>> and don't even log an error.
>>>
>>> But QEMU_MADV_DONTDUMP might also be broken, because that
>>> qemu_madvise() call will just fail.
>>>
>>> Your fix would be correct. But I do wonder if we want to just let that
>>> case fail and warn users that they are doing something that doesn't
>>> make too much sense.
>>>
>>
>> Yeah, what's suspicious is: if the size is smaller than page size we
>> error out, but if it's larger (but still not aligned) we accept that.
>> I'm failing to see reasoning there. Looks like the ROUND_UP() was added
>> in v0.13.0-rc0~1201^2~4 (though it's done with some bit magic) and the
>> check itself was added in v2.8.0-rc0~30^2~23. So it's a bit late, yes.
>
> Yeah.
>
>>
>> OTOH - if users want to waste resources, should we stop them? For
>
> It's all inconsistent, including memfd handling or what you noted above.
>
> For example, Having a 1025 MiB guest on gigantic pages, consuming 2 GiB
> really is just absolutely stupid.
>
> Likely the user wants to know about such mistakes instead of making QEMU
> silence all side effects of that. :)
Agreed. As I said, consistency should win here.
>
>
>> instance, when user requests more vCPUs than physical CPUs a warning is
>> printed:
>>
>> $ ./build/qemu-system-x86_64 -accel kvm -smp cpus=128
>> qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested
>> (128) exceeds the recommended cpus supported by KVM (8)
>
> But that case is still reasonable for testing guest behavior with many
> vCPUs, or migrating from a machine with more vCPUs.
>
> Here, the guest will actually see all vCPUs. In comparison, the memory
> waste here will never ever be consumable by the VM.
Good point.
>
>>
>> but that's about it. So maybe the error can be demoted to just a warning?
>
> The question is what we want to do, for example, with the
> qemu_madvise(QEMU_MADV_DONTDUMP). It will similarly simply fail.
It will. But the retval of qemu_madvise() is not checked here, and in
qemu_ram_setup_dump() it's just a warning.
>
> I'm curious, are there real customers running into that?
>
>
No, I haven't seen any bugreport from a customer, just our QE ran into
this issue: https://issues.redhat.com/browse/RHEL-1127 (I've asked to
make this issue public).
> We could fix it all that but always warn when something like that is
> being done.
>
Fair enough. After all of this, I'm inclined to turn this into a proper
error and deny not page aligned sizes. There's no real benefit in having
them and furthermore, the original bug report is about cryptic error
message.
Michal