bug-libsigsegv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libsigsegv] [Qemu-devel] [PATCH 00/21] RFC: userfaultfd v3


From: Eric Blake
Subject: Re: [bug-libsigsegv] [Qemu-devel] [PATCH 00/21] RFC: userfaultfd v3
Date: Fri, 06 Mar 2015 08:29:35 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

[adding libsigsegv project]

On 03/05/2015 10:17 AM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> This is a RFC for the userfaultfd syscall API v3 that addresses the
> feedback received for the previous v2 submit.
> 
> The main change from the v2 is that MADV_USERFAULT/NOUSERFAULT
> disappeared (they're replaced by the UFFDIO_REGISTER/UNREGISTER
> ioctls). In short userfaults are now only possible through the
> userfaultfd. The remap_anon_pages syscall also disappeared replaced by
> the UFFDIO_REMAP ioctl which is in turn mostly obsoleted by the newer
> UFFDIO_COPY and UFFDIO_ZEROPAGE ioctls that are indeed more efficient
> by never having to flush the TLB. The suggestion to copy the data
> instead of moving it, in order to resolve the userfault, was
> immediately agreed.
> 
> The latest code can also be cloned here:
> 
> git clone --reference linux -b userfault 
> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> 
> 
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control on
> various types of page faults.
> 
> For example userfaults allows a proper and more optimal implementation
> of the PROT_NONE+SIGSEGV trick.

Which is what GNU libsigsegv currently uses.  Anyone interested in
adding code to libsigsegv to take advantage of this proposed new kernel
interface?

> 
> There has been interest from multiple users for different use cases:
> 
> 1) KVM postcopy live migration (one form of cloud memory
>    externalization). KVM postcopy live migration is the primary driver
>    of this work:
>    
> http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
>    http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
>    )
> 
> 2) KVM postcopy live snapshotting (allowing to limit/throttle the
>    memory usage, unlike fork would, plus the avoidance of fork
>    overhead in the first place).
> 
>    The syscall API is already contemplating the wrprotect fault
>    tracking and it's generic enough to allow its later implementation
>    in a backwards compatible fashion.
> 
> 3) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
>    should be extended to work also on tmpfs and then the
>    uffdio_register.ioctls will notify userland that UFFDIO_COPY is
>    available even when the registered virtual memory range is tmpfs
>    backed.
> 
> 4) alternate mechanism to notify web browsers or apps on embedded
>    devices that volatile pages have been reclaimed. This basically
>    avoids the need to run a syscall before the app can access with the
>    CPU the virtual regions marked volatile. This also requires point 3)
>    to be fulfilled, as volatile pages happily apply to tmpfs.
> 
> 5) postcopy live migration of binaries inside linux containers.
> 
> Even though there wasn't a real use case requesting it yet, the new
> API also allows to implement distributed shared memory in a way that
> readonly shared mappings can exist simultaneously in different hosts
> and they can be become exclusive at the first wrprotect fault.
> 
> The UFFDIO_REMAP method is still present in the patchset but it's
> provided primarily to remove (add not) memory from the userfault
> range. The addition of the UFFDIO_REMAP method is intentionally kept
> at the end of the patchset. The postcopy live migration qemu code will
> only use UFFDIO_COPY and UFFDIO_ZEROPAGE. UFFDIO_REMAP isn't intended
> to be merged upstream in the short term, and it can be dropped later
> if there's an agreement it's a bad idea to keep it around in the
> patchset.
> 
> David run some KVM postcopy live migration benchmarks on a 8-way CPU
> system and he measured that using UFFDIO_COPY instead of UFFDIO_REMAP
> resulted in a roughly a -20% reduction in latency which is good. The
> standard deviation error on the latency measurement decreased
> significantly as well (because the number of CPUs that required IPI
> delivery was variable, while the copy always takes roughly the same
> time). A bigger improvement is expectable if measured on a larger host
> with more CPUs.
> 
> All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
> live migration and the UFFD can be passed to a manager process through
> unix domain sockets to satisfy point 5).
> 
> I look forward to discuss this further next week at the LSF/MM
> summit, if you're attending the summit see you soon!
> 
> Comments welcome, thanks,
> Andrea
> 
> Credits: partially funded by the Orbit EU project.
> 
> PS. There is one TODO detail worth mentioning for completeness that
> affects usage 2) and UFFDIO_REMAP if used to remove memory from the
> userfault range: handle_userfault() is only effective if
> FAULT_FLAG_ALLOW_RETRY is set... but that is only set at the first
> attempted page fault. If by accident some thread was already faulting
> in the range and the first page fault attempt returned VM_FAULT_RETRY
> and UFFDIO_REMAP or UFFDIO_WP jumps in to arm the userfault just
> before the second attempt starts, a SIGBUS would be raised by the page
> fault. Stopping all thread access to the userfault ranges during
> UFFDIO_REMAP/WP while possible, isn't optimal. Currently (excluding
> real filebacked mappings and handle_userfault() itself which is
> clearly no problem) only tmpfs or a swapin can return
> VM_FAULT_RETRY. To close this SIGBUS window for all usages, the
> simplest solution would be that if FAULT_FLAG_TRIED is set
> VM_FAULT_RETRY can still be returned (but only by handle_userfault
> that has a legitimate reason for insisting a second time in a row with
> VM_FAULT_RETRY). That would require some change to the FAULT_FLAG
> semantics. Again userland could cope with this detail but it'd be
> inefficient to solve it in userland. This would be a fully backwards
> compatible change and it's only strictly required by the wrprotect
> tracking mode, so it's no problem to solve this later. Because of its
> inherent racy nature, nobody could possibly depend on a racy SIGBUS
> being raised now, when it won't be raised anymore later.
> 
> Andrea Arcangeli (21):
>   userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: linux/Documentation/vm/userfaultfd.txt
>   userfaultfd: uAPI
>   userfaultfd: linux/userfaultfd_k.h
>   userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
>   userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
>   userfaultfd: call handle_userfault() for userfaultfd_missing() faults
>   userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
>   userfaultfd: prevent khugepaged to merge if userfaultfd is armed
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: buildsystem activation
>   userfaultfd: activate syscall
>   userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
>   userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE
>     preparation
>   userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE
>   userfaultfd: remap_pages: rmap preparation
>   userfaultfd: remap_pages: swp_entry_swapcount() preparation
>   userfaultfd: UFFDIO_REMAP uABI
>   userfaultfd: remap_pages: UFFDIO_REMAP preparation
>   userfaultfd: UFFDIO_REMAP
>   userfaultfd: add userfaultfd_wp mm helpers
> 
>  Documentation/ioctl/ioctl-number.txt   |    1 +
>  Documentation/vm/userfaultfd.txt       |   97 +++
>  arch/powerpc/include/asm/systbl.h      |    1 +
>  arch/powerpc/include/asm/unistd.h      |    2 +-
>  arch/powerpc/include/uapi/asm/unistd.h |    1 +
>  arch/x86/syscalls/syscall_32.tbl       |    1 +
>  arch/x86/syscalls/syscall_64.tbl       |    1 +
>  fs/Makefile                            |    1 +
>  fs/userfaultfd.c                       | 1128 
> ++++++++++++++++++++++++++++++++
>  include/linux/mm.h                     |    4 +-
>  include/linux/mm_types.h               |   11 +
>  include/linux/swap.h                   |    6 +
>  include/linux/syscalls.h               |    1 +
>  include/linux/userfaultfd_k.h          |  112 ++++
>  include/linux/wait.h                   |    5 +-
>  include/uapi/linux/userfaultfd.h       |  150 +++++
>  init/Kconfig                           |   11 +
>  kernel/fork.c                          |    3 +-
>  kernel/sched/wait.c                    |    7 +-
>  kernel/sys_ni.c                        |    1 +
>  mm/Makefile                            |    1 +
>  mm/huge_memory.c                       |  217 +++++-
>  mm/madvise.c                           |    3 +-
>  mm/memory.c                            |   16 +
>  mm/mempolicy.c                         |    4 +-
>  mm/mlock.c                             |    3 +-
>  mm/mmap.c                              |   39 +-
>  mm/mprotect.c                          |    3 +-
>  mm/rmap.c                              |    9 +
>  mm/swapfile.c                          |   13 +
>  mm/userfaultfd.c                       |  793 ++++++++++++++++++++++
>  net/sunrpc/sched.c                     |    2 +-
>  32 files changed, 2593 insertions(+), 54 deletions(-)
>  create mode 100644 Documentation/vm/userfaultfd.txt
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd_k.h
>  create mode 100644 include/uapi/linux/userfaultfd.h
>  create mode 100644 mm/userfaultfd.c
> 
> 
> 
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]