qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] virtiofsd: add container-friendly -o sandbox=chroot optio


From: Vivek Goyal
Subject: Re: [PATCH v3] virtiofsd: add container-friendly -o sandbox=chroot option
Date: Thu, 22 Oct 2020 15:24:42 -0400

On Thu, Oct 22, 2020 at 08:19:54PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > virtiofsd cannot run in a container because CAP_SYS_ADMIN is required to
> > create namespaces.
> > 
> > Introduce a weaker sandbox mode that is sufficient in container
> > environments because the container runtime already sets up namespaces.
> > Use chroot to restrict path traversal to the shared directory.
> > 
> > virtiofsd loses the following:
> > 
> > 1. Mount namespace. The process chroots to the shared directory but
> >    leaves the mounts in place. Seccomp rejects mount(2)/umount(2)
> >    syscalls.
> > 
> > 2. Pid namespace. This should be fine because virtiofsd is the only
> >    process running in the container.
> > 
> > 3. Network namespace. This should be fine because seccomp already
> >    rejects the connect(2) syscall, but an additional layer of security
> >    is lost. Container runtime-specific network security policies can be
> >    used drop network traffic (except for the vhost-user UNIX domain
> >    socket).
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> I've just tripped over another case where this probably helps (but not
> yet tested...); pivot_root doesn't work if your current / isn't a
> mountpoint - so you can't currently run the existing virtiofsd inside
> a chroot.

Can we avoid that issue simply by doing a bind mount of directory
before chroot().

Vivek

> 
> (pivot_root is awful for telling you this - it has 6 different manpage
> listed reasons it might return EINVAL and leaves you to figure out how
> you offended it).
> 
> Dave
> 
> > ---
> > v3:
> >  * Rebased onto David Gilbert's latest migration & virtiofsd pull
> >    request
> > 
> >  tools/virtiofsd/helper.c         |  8 +++++
> >  tools/virtiofsd/passthrough_ll.c | 57 ++++++++++++++++++++++++++++++--
> >  docs/tools/virtiofsd.rst         | 32 ++++++++++++++----
> >  3 files changed, 88 insertions(+), 9 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> > index 85770d63f1..2e181a49b5 100644
> > --- a/tools/virtiofsd/helper.c
> > +++ b/tools/virtiofsd/helper.c
> > @@ -166,6 +166,14 @@ void fuse_cmdline_help(void)
> >             "                               enable/disable readirplus\n"
> >             "                               default: readdirplus except 
> > with "
> >             "cache=none\n"
> > +           "    -o sandbox=namespace|chroot\n"
> > +           "                               sandboxing mode:\n"
> > +           "                               - namespace: mount, pid, and 
> > net\n"
> > +           "                                 namespaces with 
> > pivot_root(2)\n"
> > +           "                                 into shared directory\n"
> > +           "                               - chroot: chroot(2) into 
> > shared\n"
> > +           "                                 directory (use in 
> > containers)\n"
> > +           "                               default: namespace\n"
> >             "    -o timeout=<number>        I/O timeout (seconds)\n"
> >             "                               default: depends on cache= 
> > option.\n"
> >             "    -o writeback|no_writeback  enable/disable writeback 
> > cache\n"
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index ff53df4451..5b9064278a 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -137,8 +137,14 @@ enum {
> >      CACHE_ALWAYS,
> >  };
> >  
> > +enum {
> > +    SANDBOX_NAMESPACE,
> > +    SANDBOX_CHROOT,
> > +};
> > +
> >  struct lo_data {
> >      pthread_mutex_t mutex;
> > +    int sandbox;
> >      int debug;
> >      int writeback;
> >      int flock;
> > @@ -163,6 +169,12 @@ struct lo_data {
> >  };
> >  
> >  static const struct fuse_opt lo_opts[] = {
> > +    { "sandbox=namespace",
> > +      offsetof(struct lo_data, sandbox),
> > +      SANDBOX_NAMESPACE },
> > +    { "sandbox=chroot",
> > +      offsetof(struct lo_data, sandbox),
> > +      SANDBOX_CHROOT },
> >      { "writeback", offsetof(struct lo_data, writeback), 1 },
> >      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
> >      { "source=%s", offsetof(struct lo_data, source), 0 },
> > @@ -2660,6 +2672,41 @@ static void setup_capabilities(char *modcaps_in)
> >      pthread_mutex_unlock(&cap.mutex);
> >  }
> >  
> > +/*
> > + * Use chroot as a weaker sandbox for environments where the process is
> > + * launched without CAP_SYS_ADMIN.
> > + */
> > +static void setup_chroot(struct lo_data *lo)
> > +{
> > +    lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> > +    if (lo->proc_self_fd == -1) {
> > +        fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/fd\", O_PATH): %m\n");
> > +        exit(1);
> > +    }
> > +
> > +    /*
> > +     * Make the shared directory the file system root so that FUSE_OPEN
> > +     * (lo_open()) cannot escape the shared directory by opening a symlink.
> > +     *
> > +     * The chroot(2) syscall is later disabled by seccomp and the
> > +     * CAP_SYS_CHROOT capability is dropped so that tampering with the 
> > chroot
> > +     * is not possible.
> > +     *
> > +     * However, it's still possible to escape the chroot via 
> > lo->proc_self_fd
> > +     * but that requires first gaining control of the process.
> > +     */
> > +    if (chroot(lo->source) != 0) {
> > +        fuse_log(FUSE_LOG_ERR, "chroot(\"%s\"): %m\n", lo->source);
> > +        exit(1);
> > +    }
> > +
> > +    /* Move into the chroot */
> > +    if (chdir("/") != 0) {
> > +        fuse_log(FUSE_LOG_ERR, "chdir(\"/\"): %m\n");
> > +        exit(1);
> > +    }
> > +}
> > +
> >  /*
> >   * Lock down this process to prevent access to other processes or files 
> > outside
> >   * source directory.  This reduces the impact of arbitrary code execution 
> > bugs.
> > @@ -2667,8 +2714,13 @@ static void setup_capabilities(char *modcaps_in)
> >  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
> >                            bool enable_syslog)
> >  {
> > -    setup_namespaces(lo, se);
> > -    setup_mounts(lo->source);
> > +    if (lo->sandbox == SANDBOX_NAMESPACE) {
> > +        setup_namespaces(lo, se);
> > +        setup_mounts(lo->source);
> > +    } else {
> > +        setup_chroot(lo);
> > +    }
> > +
> >      setup_seccomp(enable_syslog);
> >      setup_capabilities(g_strdup(lo->modcaps));
> >  }
> > @@ -2815,6 +2867,7 @@ int main(int argc, char *argv[])
> >      struct fuse_session *se;
> >      struct fuse_cmdline_opts opts;
> >      struct lo_data lo = {
> > +        .sandbox = SANDBOX_NAMESPACE,
> >          .debug = 0,
> >          .writeback = 0,
> >          .posix_lock = 0,
> > diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
> > index 7ecee49834..65f8e76569 100644
> > --- a/docs/tools/virtiofsd.rst
> > +++ b/docs/tools/virtiofsd.rst
> > @@ -17,13 +17,24 @@ This program is designed to work with QEMU's ``--device 
> > vhost-user-fs-pci``
> >  but should work with any virtual machine monitor (VMM) that supports
> >  vhost-user.  See the Examples section below.
> >  
> > -This program must be run as the root user.  Upon startup the program will
> > -switch into a new file system namespace with the shared directory tree as 
> > its
> > -root.  This prevents "file system escapes" due to symlinks and other file
> > -system objects that might lead to files outside the shared directory.  The
> > -program also sandboxes itself using seccomp(2) to prevent ptrace(2) and 
> > other
> > -vectors that could allow an attacker to compromise the system after gaining
> > -control of the virtiofsd process.
> > +This program must be run as the root user.  The program drops privileges 
> > where
> > +possible during startup although it must be able to create and access files
> > +with any uid/gid:
> > +
> > +* The ability to invoke syscalls is limited using seccomp(2).
> > +* Linux capabilities(7) are dropped.
> > +
> > +In "namespace" sandbox mode the program switches into a new file system
> > +namespace and invokes pivot_root(2) to make the shared directory tree its 
> > root.
> > +A new pid and net namespace is also created to isolate the process.
> > +
> > +In "chroot" sandbox mode the program invokes chroot(2) to make the shared
> > +directory tree its root. This mode is intended for container environments 
> > where
> > +the container runtime has already set up the namespaces and the program 
> > does
> > +not have permission to create namespaces itself.
> > +
> > +Both sandbox modes prevent "file system escapes" due to symlinks and other 
> > file
> > +system objects that might lead to files outside the shared directory.
> >  
> >  Options
> >  -------
> > @@ -69,6 +80,13 @@ Options
> >    * readdirplus|no_readdirplus -
> >      Enable/disable readdirplus.  The default is ``readdirplus``.
> >  
> > +  * sandbox=namespace|chroot -
> > +    Sandbox mode:
> > +    - namespace: Create mount, pid, and net namespaces and pivot_root(2) 
> > into
> > +    the shared directory.
> > +    - chroot: chroot(2) into shared directory (use in containers).
> > +    The default is "namespace".
> > +
> >    * source=PATH -
> >      Share host directory tree located at PATH.  This option is required.
> >  
> > -- 
> > 2.26.2
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]