qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] coroutine: cap per-thread local pool size


From: Kevin Wolf
Subject: Re: [PATCH] coroutine: cap per-thread local pool size
Date: Tue, 19 Mar 2024 17:54:38 +0100

Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben:
> On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote:
> > The coroutine pool implementation can hit the Linux vm.max_map_count
> > limit, causing QEMU to abort with "failed to allocate memory for stack"
> > or "failed to set up stack guard page" during coroutine creation.
> > 
> > This happens because per-thread pools can grow to tens of thousands of
> > coroutines. Each coroutine causes 2 virtual memory areas to be created.
> 
> This sounds quite alarming. What usage scenario is justified in
> creating so many coroutines?

Basically we try to allow pooling coroutines for as many requests as
there can be in flight at the same time. That is, adding a virtio-blk
device increases the maximum pool size by num_queues * queue_size. If
you have a guest with many CPUs, the default num_queues is relatively
large (the bug referenced by Stefan had 64), and queue_size is 256 by
default. That's 16k potential requests in flight per disk.

Another part of it is just that our calculation didn't make a lot of
sense. Instead of applying this number to the pool size of the iothread
that would actually get the requests, we applied it to _every_ iothread.
This is fixed with this patch, it's a global number applied to a global
pool now.

> IIUC, coroutine stack size is 1 MB, and so tens of thousands of
> coroutines implies 10's of GB of memory just on stacks alone.

That's only virtual memory, though. Not sure how much of it is actually
used in practice.

> > Eventually vm.max_map_count is reached and memory-related syscalls fail.
> 
> On my system max_map_count is 1048576, quite alot higher than
> 10's of 1000's. Hitting that would imply ~500,000 coroutines and
> ~500 GB of stacks !

Did you change the configuration some time in the past, or is this just
a newer default? I get 65530, and that's the same default number I've
seen in the bug reports.

> > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> > index 5fd2dbaf8b..2790959eaf 100644
> > --- a/util/qemu-coroutine.c
> > +++ b/util/qemu-coroutine.c
> 
> > +static unsigned int get_global_pool_hard_max_size(void)
> > +{
> > +#ifdef __linux__
> > +    g_autofree char *contents = NULL;
> > +    int max_map_count;
> > +
> > +    /*
> > +     * Linux processes can have up to max_map_count virtual memory areas
> > +     * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this 
> > limit. We
> > +     * must limit the coroutine pool to a safe size to avoid running out of
> > +     * VMAs.
> > +     */
> > +    if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL,
> > +                            NULL) &&
> > +        qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) {
> > +        /*
> > +         * This is a conservative upper bound that avoids exceeding
> > +         * max_map_count. Leave half for non-coroutine users like library
> > +         * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so
> > +         * halve the amount again.
> > +         */
> > +        return max_map_count / 4;
> 
> That's 256,000 coroutines, which still sounds incredibly large
> to me.

The whole purpose of the limitation is that you won't ever get -ENOMEM
back, which will likely crash your VM. Even if this hard limit is high,
that doesn't mean that it's fully used. Your setting of 1048576 probably
means that you would never have hit the crash anyway.

Even the benchmarks that used to hit the problem don't even get close to
this hard limit any more because the actual number of coroutines stays
much smaller after applying this patch.

> > +    }
> > +#endif
> > +
> > +    return UINT_MAX;
> 
> Why UINT_MAX as a default ?  If we can't read procfs, we should
> assume some much smaller sane default IMHO, that corresponds to
> what current linux default max_map_count would be.

I don't think we should artificially limit the pool size and with this
potentially limit the performance with it even if the host could do more
if we only allowed it to. If we can't read it from procfs, then it's
your responsibility as a user to make sure that it's large enough for
your VM configuration.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]