Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESE

From:	David Hildenbrand
Subject:	Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
Date:	Wed, 3 Mar 2021 11:14:10 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0

On 02.03.21 22:44, Peter Xu wrote:

On Tue, Mar 02, 2021 at 08:01:11PM +0100, David Hildenbrand wrote:

On 02.03.21 18:51, Peter Xu wrote:

On Tue, Feb 09, 2021 at 02:49:38PM +0100, David Hildenbrand wrote:

+#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
+static bool map_noreserve_effective(int fd, bool shared)
+{
+#if defined(__linux__)
+    gchar *content = NULL;
+    const char *endptr;
+    unsigned int tmp;
+
+    /* hugetlbfs behaves differently */
+    if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
+        return true;
+    }
+
+    /* only private shared mappings are accounted (ignoring /dev/zero) */
+    if (fd != -1 && shared) {
+        return true;
+    }

[1]

+
+    if (g_file_get_contents(OVERCOMMIT_MEMORY_PATH, &content, NULL, NULL) &&
+        !qemu_strtoui(content, &endptr, 0, &tmp) &&
+        (!endptr || *endptr == '\n')) {
+        if (tmp == 2) {
+            error_report("Skipping reservation of swap space is not supported: 
"
+                         " \"" OVERCOMMIT_MEMORY_PATH "\" is \"2\"");
+            return false;
+        }
+        return true;
+    }
+    /* this interface has been around since Linux 2.6 */
+    error_report("Skipping reservation of swap space is not supported: "
+                 " Could not read: \"" OVERCOMMIT_MEMORY_PATH "\"");
+    return false;
+#else
+    return true;
+#endif
+}


I feel like this helper wants to fail gracefully for some conditions.  Could
you elaborate one example and attach to the commit log?


Sure. The case is "/proc/sys/vm/overcommit_memory == 2" (never overcommit)

MAP_NORESERVE is without effect and sparse memory regions are somewhat
impossible.


I'm also wondering whether it would worth to check the global value.  Even if
overcommit is globally disabled, do we (as an application process) need to care
about it?  I think the MAP_NORESERVE would simply be silently ignored by the
kernel and that seems to be design of it, otherwise would all apps who uses > 
MAP_NORESERVE would need to do similar things too?


Right, I want to catch the "gets silently ignored" part, because someone
requested "reserved=off" (!default) but does not actually get what he asked
for.

As one example, glibc manages heaps via:

a) Creating a new heap: mmap(PROT_NONE, MAP_NORESERVE) the maximum size,
then mprotect(PROT_READ|PROT_WRITE) the initial heap size. Even if
MAP_NORESERVE is ignored, only !PROT_NONE memory ever gets committed
("reserve swap space") in Linux.

b) Growing the heap via mprotect(PROT_READ|PROT_WRITE) within the existing
mmap. This will commit memory in case MAP_NORESERVE got ignored.

c) Shrinking the heap ("discard memory") via MADV_DONTNEED *unless*
"/proc/sys/vm/overcommit_memory == 2" - the only way to undo
mprotect(PROT_READ|PROT_WRITE) and to un-commit memory is by doing a
mmap(PROT_NONE, MAP_FIXED) over the problematic region.

If you're interested, you can take a look at:

malloc/arena.c
sysdeps/unix/sysv/linux/malloc-sysdep.h:check_may_shrink_heap()


Thanks for the context.  It's interesting to know libc has such special heap
operations.

Glibc shrinks heap to save memory for the no-over-commit case, however in our
case currently we'd like to fail some users using global_overcommit=2 but
reserve=off - it means even if we don't fail the user, mmap() could also fail
if it's overcommitted. Even if this mmap() didn't fail, it'll fail very easily
later on iiuc, right?


Here is the issue I want to catch.

Assume you create a VM with a virtio-mem device that can grow big in thefuture. You specify reserved=off. mmap() succeeds and you assume thevery sparse memory region does not actually commit memory. But itcommits all memory in the sparse mmap.

Assume you want to create another VM. It just fails although you stillhave plenty of free memory - because the previous MAP_NORESERVE gotignored.

I want to warn the user right away that the configuration is messed upand that "reserved=off" is not effective.

For anonymous memory, "reserved=off" will start really being useful whenhaving a way to dynamically reserve swap space.


I think it's fine to have that early failure, it just seems less helpful than
what glibc was doing which shrinks active memory for real, meanwhile there
seems to encode some very detailed OS information into this helper, so just
less charming.

It's not nice, but the messed-up Linux implementation is to blame. Justread the Linux man page of the mmap where there is an explicit link to"/proc/sys/vm/overcommit_memory" for this very reason.


Btw above [1] "fd != -1 && shared" looks weird to me.

As we never have shared anonymous memory (and that's good, because it issuper weird) in QEMU, we can simplify to


"if (shared) { return true; }"


Firstly it'll bypass overcommit_memory==2 check and return true directly, is
that right?  I thought the global will be meaningful for all memories except
hugetlbfs (in do_mmap() of Linux).

See the description in the patch. On basically all (except anonymousshared) shared memory we don't ever reserve swap space.


See mmap_region():

if (accountable_mapping(file, vm_flags)) {
        charged = len >> PAGE_SHIFT;
        if (security_vm_enough_memory_mm(mm, charged))
                return -ENOMEM;
        vm_flags |= VM_ACCOUNT;
}

whereby

accountable_mapping():

if (file && is_file_hugepages(file))
        return 0;
return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;


in do_mmap(), we only affect if VM_NORESERVE is set or not.

And this makes perfect sense: Most* shared mappings don't need swapspace because they can just be written back to the original backingstorage (file). This is different to private mappings, which cannot bewritten back to the file - so we need swap space.


Meanwhile, I don't see why file-backed share memories is so special too..  From

Just think about 100000 processes mapping the same file shared. Wouldyou want to reserve 100000 the same swap space for the same file?

Although you never* ever really need swap space because you can justwriteback to the file instead of swapping.

your commit message, I'm not sure whether you wanted to return false instead,
however that's still not the case IIUC, since e.g. /dev/shmem still does
accounting iiuc, while MAP_NORESERVE will skip it.

*except /dev/shmem. It actually needs swap space because the actual fileresides in tmpfs->RAM but Linux will never ever commit that memory.


Main reason is: because you don't know when/how much to commit.

a) When creating/truncating the file we don't know if it's going to be
   sparse. So how much should we commit?
b) When mapping the file, the same comment as above applies: you don't
   want to reserve per-mapper but instead per-file.

You can create and map gigantic memfd/shmem files even with"/proc/sys/vm/overcommit_memory == 2", because we will never ever commitany of that memory. Yes, shmem is dangerous.

Shared mappings behave like always having MAP_NORESERVE, thus the"return true;"

It's interesting that you also raise this point: I also want to proposedynamic reservation of swap space for shmem in the future. Instead ofbeing per process, this would have to be per file and similar allow tocoordinate with the kernel how much memory we are actually intending touse (->commit) such that the kernel can properly account and reject ifit wouldn't be possible. If only a single day would have more than 24hours :)


--
Thanks,

David / dhildenb

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE, Peter Xu, 2021/03/02
- Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE, David Hildenbrand, 2021/03/02
  - Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE, Peter Xu, 2021/03/02
    - Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE, David Hildenbrand <=
    - Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE, Peter Xu, 2021/03/03
    - Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE, David Hildenbrand, 2021/03/04

Prev by Date: Re: [PATCH 2/3] qapi, audio: respect build time conditions in audio schema
Next by Date: Re: [RFC PATCH v2 1/9] net: Pad short frames to minimum size before send from SLiRP/TAP
Previous by thread: Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
Next by thread: Re: [PATCH v1 8/9] util/mmap-alloc: support RAM_NORESERVE via MAP_NORESERVE
Index(es):
- Date
- Thread