qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options


From: Josh Kunz
Subject: Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
Date: Wed, 8 Jul 2020 17:16:17 -0700

Sorry for the late reply, response inline. Also I noticed a couple
mails ago I seemed to have removed the devel list and maintainers.
I've re-added them to the CC line.

On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Josh Kunz <jkz@google.com> writes:
>
> > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > (snip)
> >
> >> >> > * Non-standard libc extension to allow creating TLS images independent
> >> >> >   of threads. This would allow us to just `clone` the child directly
> >> >> >   instead of this complicated maneuver. Though we probably would still
> >> >> >   need the cleanup logic. For libcs, TLS image allocation is tightly
> >> >> >   connected to thread stack allocation, which is also arch-specific. I
> >> >> >   do not have enough experience with libc development to know if
> >> >> >   maintainers of any popular libcs would be open to supporting such an
> >> >> >   API. Additionally, since it will probably take years before a libc
> >> >> >   fix would be widely deployed, we need an interim solution anyways.
> >> >>
> >> >> We could consider a custom lib stub that intercepts calls to the guests
> >> >> original libc and replaces it with a QEMU aware one?
> >> >
> >> > Unfortunately the problem here is host libc, rather than guest libc.
> >> > We need to make TLS variables in QEMU itself work, so intercepting
> >> > guest libc calls won't help much. Or am I misunderstanding the point?
> >>
> >> Hold up - I'm a little confused now. Why does the host TLS affect the
> >> guest TLS? We have complete control over the guests view of the world so
> >> we should be able to control it's TLS storage.
> >
> > Guest TLS is unaffected, just like in the existing case for guest
> > threads. Guest TLS is handled by the guest libc and the CPU emulation.
> > Just to be clear: This series changes nothing about guest TLS.
> >
> > The complexity of this series is to deal with *host* usage of TLS.
> > That is to say: use of thread local variables in QEMU itself. Host TLS
> > is needed to allow the subprocess created with `clone(CLONE_VM, ...)`
> > to run at all. TLS variables are used in QEMU for the RCU
> > implementation, parts of the TCG, and all over the place to access the
> > CPU/TaskState for the running thread. Host TLS is managed by the host
> > libc, and TLS is only set up for host threads created via
> > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a
> > virtual memory map *and* TLS data with their parent[1], since libcs
> > provide no special handling of TLS when `clone(CLONE_VM)` is used.
> > Without the workaround used in this patch, both the parent and child
> > process's thread local variables reference the same memory locations.
> > This just doesn't work, since thread local data is assumed to actually
> > be thread local.
> >
> > The "alternative" proposed was to make the host libc support TLS for
> > processes created using clone (there are several ways to go about
> > this, each with different tradeoffs). You mentioned that "We could
> > consider a custom lib stub that intercepts calls to the guests
> > original libc..." in your comment. Since *guest* libc is not involved
> > here I was a bit confused about how this could help, and wanted to
> > clarify.
> >
> >> >> Have you considered a daemon which could co-ordinate between the
> >> >> multiple processes that are sharing some state?
> >> >
> >> > Not really for the `CLONE_VM` support added in this patch series. I
> >> > have considered trying to pull tcg out of the guest process, but not
> >> > very seriously, since it seems like a pretty heavyweight approach.
> >> > Especially compared to the solution included in this series. Do you
> >> > think there's a simpler approach that involves using a daemon to do
> >> > coordination?
> >>
> >> I'm getting a little lost now. Exactly what state are we trying to share
> >> between two QEMU guests which are now in separate execution contexts?
> >
> > Since this series only deals with `clone(CLONE_VM)` we always want to
> > share guest virtual memory between the execution contexts. There is
> > also some extra state that needs to be shared depending on which flags
> > are provided to `clone()`. E.g., signal handler tables for
> > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc.
> >
> > The problem is that since QEMU and the guest live in the same virtual
> > memory map, keeping the mappings the same between the guest parent and
> > guest child means that the mappings also stay the same between the
> > host (QEMU) parent and host child. Two hosts can live in the same
> > virtual memory map, like we do right now with threads, but *only* with
> > valid TLS for each thread/process. That's why we bend-over backwards
> > to get set-up TLS for emulation in the child process.
>
> OK thanks for that. I'd obviously misunderstood from my first read
> through. So while hiding the underlying bits of QEMU from the guest is
> relatively easy it's quite hard to hide QEMU from itself in this
> CLONE_VM case.

Yes exactly.

> The other approach would be to suppress CLONE_VM for the actual process
> (thereby allowing QEMU to safely have a new instance and no clashing
> shared data) but emulate CLONE_VM for the guest itself (making the guest
> portions of memory shared and visible to each other). The trouble then
> would be co-ordination of mapping operations and other things that
> should be visible in a real CLONE_VM setup. This is the sort of
> situation I envisioned a co-ordination daemon might be useful.

Ah. This is interesting. Effectively the inverse of this patch. I had
not considered this approach. Thinking more about it, a "no shared
memory" approach does seem more straightforward implementation wise.
Unfortunately I think there would be a few substantial drawbacks:

1. Memory overhead. Every guest thread would need a full copy of QEMU
memory, including the translated guest binary.
2. Performance overhead. To keep virtual memory maps consistent across
tasks, a heavyweight 2 phase commit scheme, or similar, would be
needed for every `mmap`. That could have substantial performance
overhead for the guest. This could be a huge problem for processes
that use a large number of threads *and* do a lot of memory mapping or
frequently change page permissions.
3. There would be lots of similarly-fiddly bits that need to be shared
and coordinated in addition to guest memory. At least the signal
handler tables and fd_trans tables, but there are likely others I'm
missing.

The performance drawbacks could be largely mitigated by using the
current thread-only `CLONE_VM` support, but having *any* threads in
the process at all would lead to deadlocks after fork() or similar
non-CLONE_VM clone() calls. This could be worked around with a "stop
the world" button somewhat like `start_exclusive`, but expanded to
include all emulator threads. That will substantially slow down
fork().

Given all this I think the approach used in this series is probably at
least as "good" as a "no shared memory" approach. It has its own
complexities and drawbacks, but doesn't have obvious performance
issues. If you or other maintainers disagree, I'd be happy to write up
an RFC comparing the approaches in more detail (or we can just use
this thread), just let me know. Until then I'll keep pursuing this
patch.

> > [1] At least on x86_64, because TLS references are defined in terms of
> > the %fs segment, which is inherited on linux. Theoretically it's up to
> > the architecture to specify how TLS is inherited across execution
> > contexts. t's possible that the child actually ends up with no valid
> > TLS rather than using the parent TLS data. But that's not really
> > relevant here. The important thing is that the child ends up with
> > *valid* TLS, not invalid or inherited TLS.
>
>
> --
> Alex Bennée

--
Josh Kunz



reply via email to

[Prev in Thread] Current Thread [Next in Thread]