qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORE


From: David Hildenbrand
Subject: Re: [PATCH v7 09/15] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
Date: Tue, 4 May 2021 13:04:17 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

On 04.05.21 12:32, Daniel P. Berrangé wrote:
On Tue, May 04, 2021 at 12:21:25PM +0200, David Hildenbrand wrote:
On 04.05.21 12:09, Daniel P. Berrangé wrote:
On Wed, Apr 28, 2021 at 03:37:48PM +0200, David Hildenbrand wrote:
Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
effect on most shared mappings - except for hugetlbfs and anonymous memory.

Linux man page:
    "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
    space is reserved, one has the guarantee that it is possible to modify
    the mapping. When swap space is not reserved one might get SIGSEGV
    upon a write if no physical memory is available. See also the discussion
    of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
    2.6, this flag had effect only for private writable mappings."

Note that the "guarantee" part is wrong with memory overcommit in Linux.

Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).

The rough behavior is [1]:
a) !Hugetlbfs:

    1) Without MAP_NORESERVE *or* with memory overcommit under Linux
       disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
       accounting/reservation happens:
        For a file backed map
         SHARED or READ-only - 0 cost (the file is the map not swap)
         PRIVATE WRITABLE - size of mapping per instance

        For an anonymous or /dev/zero map
         SHARED   - size of mapping
         PRIVATE READ-only - 0 cost (but of little use)
         PRIVATE WRITABLE - size of mapping per instance

    2) With MAP_NORESERVE, no accounting/reservation happens.

b) Hugetlbfs:

    1) Without MAP_NORESERVE, huge pages are reserved.

    2) With MAP_NORESERVE, no huge pages are reserved.

Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Can you explain this use case in more real world terms, as I'm not
understanding what a mgmt app would actually do with this in
practice ?

Let's consider huge pages for simplicity. Assume you have 128 free huge
pages in your hypervisor that you want to dynamically assign to VMs.

Further assume you have two VMs running. A workflow could look like

1. Assign all huge pages to VM 0
2. Reassign 64 huge pages to VM 1
3. Reassign another 32 huge pages to VM 1
4. Reasssign 16 huge pages to VM 0
5. ...

Basically what we're used to doing with "ordinary" memory.

What does this look like in terms of the memory backend configuration
when you boot VM 0 and VM 1 ?

Are you saying that we boot both VMs with

    -object hostmem-memfd,size=128G,hugetlb=yes,hugetlbsize=1G,reserve=off

and then we have another property set on 'virtio-mem' to tell it
how much/little of that 128 G, to actually give to the guest ?
How do we change that at runtime ?

Roughly, yes. We only special-case memory backends managed by virtio-mem 
devices.

An advanced example for a single VM could look like this:

sudo build/qemu-system-x86_64 \
        ... \
        -m 4G,maxmem=64G \
        -smp sockets=2,cores=2 \
        -object hostmem-memfd,id=bmem0,size=2G,hugetlb=yes,hugetlbsize=2M \
        -numa node,nodeid=0,cpus=0-1,memdev=bmem0 \
        -object hostmem-memfd,id=bmem1,size=2G,hugetlb=yes,hugetlbsize=2M \
        -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
        ... \
        -object 
hostmem-memfd,id=mem0,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
        -device virtio-mem-pci,id=vmem0,memdev=mem0,node=0,requested-size=0G \
        -object 
hostmem-memfd,id=mem1,size=30G,hugetlb=yes,hugetlbsize=2M,reserve=off \
        -device virtio-mem-pci,id=vmem1,memdev=mem1,node=1,requested-size=0G \
        ... \

We can request a size change by adjusting the "requested-size" property (e.g., 
via qom-set)
and observe the current size by reading the "size" property (e.g., qom-get). 
Think of
it as an advanced device-local memory balloon mixed with the concept of a 
memory hotplug.


I suggest taking a look at the libvirt virito-mem implemetation
-- don't think it's upstream yet:

https://lkml.kernel.org/r/cover.1615982004.git.mprivozn@redhat.com

I'm CCing Michal -- I already gave him a note upfront which additional
properties we might see for memory backends (e.g., reserve, managed-size)
and virtio-mem devices (e.g., iothread, prealloc, reserve, prot).



For that to work with virtio-mem, you'll have to disable reservation of huge
pages for the virtio-mem managed memory region.

(prealloction of huge pages in virtio-mem to protect from user mistakes is a
separate work item)

reserve=off will be the default for virtio-mem, and actual
reservation/preallcoation will be done within virtio-mem. There could be use
for "reserve=off" for virtio-balloon use cases as well, but I'd like to
exclude that from the discussion for now.

The hostmem backend defaults are indepdant of frontend usage, so when you
say reserve=off is the default for virtio-mem, are you expecting the mgmt
app like libvirt to specify that ?

Sorry, yes exactly; only for the memory backend managed by a virtio-mem device.

--
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]