Skip to content

executes `unshare` and `newuidmap/newgidmap` in a single command, plus slirp

License

Notifications You must be signed in to change notification settings

AkihiroSuda/rootlesskit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RootlessKit: the gate to the rootless world

RootlessKit is a kind of Linux-native "fake root" utility, made for mainly running Docker and Kubernetes as an unprivileged user, so as to protect the real root on the host from potential container-breakout attacks.

What it actually does

RootlessKit creates user_namespaces(7) and mount_namespaces(7), and executes newuidmap(1)/newgidmap(1) along with subuid(5) and subgid(5).

RootlessKit also supports isolating network_namespaces(7) with userspace NAT using "slirp". Kernel NAT using SUID-enabled lxc-user-nic(1) is also experimentally supported.

Projects using RootlessKit

  • Docker/Moby
  • Usernetes: Docker & Kubernetes, installable under a non-root user's $HOME.
  • k3s: Lightweight Kubernetes
  • BuildKit: Next-generation docker build backend

Setup

$ go get github.com/rootless-containers/rootlesskit/cmd/rootlesskit
$ go get github.com/rootless-containers/rootlesskit/cmd/rootlessctl

or just run make to make binaries under ./bin directory.

Requirements

  • newuidmap and newgidmap need to be installed on the host. These commands are provided by the uidmap package on most distributions.

  • /etc/subuid and /etc/subgid should contain more than 65536 sub-IDs. e.g. penguin:231072:65536. These files are automatically configured on most distributions.

$ id -u
1001
$ whoami
penguin
$ grep "^$(whoami):" /etc/subuid
penguin:231072:65536
$ grep "^$(whoami):" /etc/subgid
penguin:231072:65536

Distribution-specific hints

Debian (excluding Ubuntu):

Arch Linux:

  • sudo sh -c "echo 1 > /proc/sys/kernel/unprivileged_userns_clone" is required

RHEL/CentOS 7 (excluding RHEL/CentOS 8):

  • sudo sh -c "echo 28633 > /proc/sys/user/max_user_namespaces" is required

To persist sysctl configurations, edit /etc/sysctl.conf or add a file under /etc/sysctl.d.

Usage

Inside rootlesskit, your UID is mapped to 0 but it is not the real root:

$ rootlesskit bash
rootlesskit$ id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
rootlesskit$ ls -l /etc/shadow
-rw-r----- 1 nobody nogroup 1050 Aug 21 19:02 /etc/shadow
rootlesskit$ $ cat /etc/shadow
cat: /etc/shadow: Permission denied

Environment variables are kept untouched:

$ rootlesskit bash
rootlesskit$ echo $USER
penguin
rootlesskit$ echo $HOME
/home/penguin
rootlesskit$ echo $XDG_RUNTIME_DIR
/run/user/1001

Filesystems can be isolated from the host with --copy-up:

$ rootlesskit --copy-up=/etc bash
rootlesskit$ rm /etc/resolv.conf
rootlesskit$ vi /etc/resolv.conf

You can even create network namespaces with Slirp:

$ rootlesskit --copy-up=/etc --copy-up=/run --net=slirp4netns --disable-host-loopback bash
rootlesskit$ ip netns add foo
...

Proc filesystem view:

$ rootlesskit bash
rootlesskit$ cat /proc/self/uid_map
         0       1001          1
         1     231072      65536
rootlesskit$ cat /proc/self/gid_map
         0       1001          1
         1     231072      65536
rootlesskit$ cat /proc/self/setgroups
allow

Full CLI options:

NAME:
   rootlesskit - the gate to the rootless world

USAGE:
   rootlesskit [global options] command [command options] [arguments...]

VERSION:
   0.7.0+dev

COMMANDS:
     help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug                      debug mode
   --state-dir value            state directory
   --net value                  network driver [host, slirp4netns, vpnkit, lxc-user-nic(experimental), vdeplug_slirp(deprecated)] (default: "host")
   --slirp4netns-binary value   path of slirp4netns binary for --net=slirp4netns (default: "slirp4netns")
   --slirp4netns-sandbox value  enable slirp4netns sandbox (experimental) [auto, true, false] (the default is planned to be "auto" in future) (default: "false")
   --slirp4netns-seccomp value  enable slirp4netns seccomp (experimental) [auto, true, false] (the default is planned to be "auto" in future) (default: "false")
   --vpnkit-binary value        path of VPNKit binary for --net=vpnkit (default: "vpnkit")
   --lxc-user-nic-binary value  path of lxc-user-nic binary for --net=lxc-user-nic (default: "/usr/lib/x86_64-linux-gnu/lxc/lxc-user-nic")
   --lxc-user-nic-bridge value  lxc-user-nic bridge name (default: "lxcbr0")
   --mtu value                  MTU for non-host network (default: 65520 for slirp4netns, 1500 for others) (default: 0)
   --cidr value                 CIDR for slirp4netns network (default: 10.0.2.0/24, requires slirp4netns v0.3.0+ for custom CIDR)
   --disable-host-loopback      prohibit connecting to 127.0.0.1:* on the host namespace
   --copy-up value              mount a filesystem and copy-up the contents. e.g. "--copy-up=/etc" (typically required for non-host network)
   --copy-up-mode value         copy-up mode [tmpfs+symlink] (default: "tmpfs+symlink")
   --port-driver value          port driver for non-host network. [none, builtin, socat(deprecated), slirp4netns(deprecated)] (default: "none")
   --publish value, -p value    publish ports. e.g. "127.0.0.1:8080:80/tcp"
   --pidns                      create a PID namespace
   --help, -h                   show help
   --version, -v                print the version

State directory

The following files will be created in the state directory, which can be specified with --state-dir:

  • lock: lock file
  • child_pid: decimal PID text that can be used for nsenter(1).
  • api.sock: REST API socket for rootlessctl. See Port Drivers section.

If --state-dir is not specified, RootlessKit creates a temporary state directory on /tmp and removes it on exit.

Undocumented files are subject to change.

Environment variables

The following environment variables will be set for the child process:

  • ROOTLESSKIT_STATE_DIR (since v0.3.0): absolute path to the state dir

Undocumented environment variables are subject to change.

PID Namespace

When --pidns (since v0.5.0) is specified, RootlessKit executes the child process in a new PID namespace. The RootlessKit child process becomes the init (PID=1). When RootlessKit terminates, all the processes in the namespace are killed with SIGKILL.

See also pid_namespaces(7).

Network Drivers

RootlessKit provides several drivers for providing network connectivity:

  • --net=host: use host network namespace (default)
  • --net=slirp4netns: use slirp4netns (recommended)
  • --net=vpnkit: use VPNKit
  • --net=lxc-user-nic: use lxc-user-nic (experimental)
  • --net=vdeplug_slirp: use vdeplug_slirp (deprecated)

Benchmark (Aug 28, 2018):

Implementation MTU=1500 MTU=4000 MTU=16384 MTU=65520
(rootful veth) (52.1 Gbps) (45.4 Gbps) (43.6 Gbps ) (51.5 Gbps)
rootlesskit --net=slirp4netns 1.07 Gbps 2.78 Gbps 4.55 Gbps 9.21 Gbps
rootlesskit --net=vpnKit 514 Mbps 526 Mbps 540 Mbps (Unsupported)
rootlesskit --net=vdeplug_slirp 763 Mbps (Unsupported) (Unsupported) (Unsupported)

|

--net=lxc-user-nic is as fast as rootful veth.

--net=host (default)

--net=host does not isolate the network namespace from the host.

Pros:

  • No performance overhead
  • Supports ICMP Echo (ping) when /proc/sys/net/ipv4/ping_group_range is configured

Cons:

  • No permission for network-namespaced operations, e.g. creating iptables rules, running tcpdump

To route ICMP Echo packets (ping), you need to write the range of GIDs to net.ipv4.ping_group_range.

$ sudo sh -c "echo 0   2147483647  > /proc/sys/net/ipv4/ping_group_range"

--net=slirp4netns (recommended)

--net=slirp4netns isolates the network namespace from the host and launch slirp4netns for providing usermode networking.

Pros:

  • Possible to perform network-namespaced operations, e.g. creating iptables rules, running tcpdump
  • Supports ICMP Echo (ping) when /proc/sys/net/ipv4/ping_group_range is configured
  • Supports hardening using mount namespace and seccomp (--slirp4netns-sandbox=auto, --slirp4netns-seccomp=auto, since RootlessKit v0.7.0, slirp4netns v0.4.0)

Cons:

  • Extra performance overhead (but still faster than --net=vpnkit)
  • Supports only TCP, UDP, and ICMP Echo packets

To use --net=slirp4netns, you need to install slirp4netns. v0.3.0 or later is recommended.

$ sudo dnf install slirp4netns

or

$ sudo apt-get install slirp4netns

If binary package is not available for your distribution, install from the source:

$ git clone https://github.com/rootless-containers/slirp4netns
$ cd slirp4netns
$ ./autogen.sh && ./configure && make
$ cp slirp4netns ~/bin

The network is configured as follows by default:

  • IP: 10.0.2.100/24
  • Gateway: 10.0.2.2
  • DNS: 10.0.2.3

The network configuration can be changed by specifying custom CIDR, e.g. --cidr=10.0.3.0/24 (requires slirp4netns v0.3.0+).

Specifying --copy-up=/etc is highly recommended unless /etc/resolv.conf on the host is statically configured. Otherwise /etc/resolv.conf in the RootlessKit's mount namespace will be unmounted when /etc/resolv.conf on the host is recreated, typically by NetworkManager or systemd-resolved.

It is also highly recommended to specyfy--disable-host-loopback. Otherwise ports listening on 127.0.0.1 in the host are accessible as 10.0.2.2 in the RootlessKit's network namespace.

Example session:

$ rootlesskit --net=slirp4netns --copy-up=/etc --disable-host-loopback bash
rootlesskit$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 1000
    link/ether 46:dc:8d:09:fd:f2 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.100/24 scope global tap0
       valid_lft forever preferred_lft forever
    inet6 fe80::44dc:8dff:fe09:fdf2/64 scope link
       valid_lft forever preferred_lft forever
ootlesskit$ ip r
default via 10.0.2.2 dev tap0
10.0.2.0/24 dev tap0 proto kernel scope link src 10.0.2.100
rootlesskit$ cat /etc/resolv.conf 
nameserver 10.0.2.3
rootlesskit$ curl https://www.google.com
<!doctype html><html ...>...</html>

Starting with RootlessKit v0.7.0 + slirp4netns v0.4.0, --slirp4netns-sandbox=auto/true/false (enables mount namespace) and --slirp4netns-seccomp=auto/true/false (enables seccomp rules) can be used to harden the slirp4netns process.

--net=vpnkit

--net=vpnkit isolates the network namespace from the host and launch VPNKit for providing usermode networking.

Pros:

  • Possible to perform network-namespaced operations, e.g. creating iptables rules, running tcpdump

Cons:

  • Extra performance overhead
  • Supports only TCP and UDP packets. No support for ICMP Echo (ping) unlike --net=slirp4netns, even if /proc/sys/net/ipv4/ping_group_range is configured.

To use --net=vpnkit, you need to install VPNkit.

$ git clone https://github.com/moby/vpnkit.git
$ cd vpnkit
$ make
$ cp vpnkit.exe ~/bin/vpnkit

The network is configured as follows by default:

  • IP: 192.168.65.3/24
  • Gateway: 192.168.65.1
  • DNS: 192.168.65.1

As in --net=slirp4netns, specifying --copy-up=/etc and --disable-host-loopback is highly recommended. If --disable-host-loopback is not specified, ports listening on 127.0.0.1 in the host are accessible as 192.168.65.2 in the RootlessKit's network namespace.

--net=lxc-user-nic (experimental)

--net=lxc-user-nic isolates the network namespace from the host and launch lxc-user-nic(1) SUID binary for providing kernel-mode NAT.

Pros:

  • No performance overhead
  • Possible to perform network-namespaced operations, e.g. creating iptables rules, running tcpdump
  • Supports ICMP Echo (ping) without /proc/sys/net/ipv4/ping_group_range configuration

Cons:

  • Less secure
  • Needs /etc/lxc/lxc-usernet configuration

To use lxc-user-nic, you need to install liblxc-common package:

$ sudo apt-get install liblxc-common

You also need to set up /etc/lxc/lxc-usernet:

# USERNAME TYPE BRIDGE COUNT
penguin    veth lxcbr0 1

The COUNT value needs to be increased to run multiple RootlessKit instances with --net=lxc-user-nic simultaneously.

It may take a few seconds to configure the interface using DHCP.

If you start and stop RootlessKit too frequently, you might use up all available DHCP addresses. You might need to reset /var/lib/misc/dnsmasq.lxcbr0.leases and restart the lxc-net service.

Currently, the MAC address is always set to a random address.

Port Drivers

To the ports in the network namespace to the host network namespace, --port-driver needs to be specified.

  • --port-driver=none: do not expose ports (default)
  • --port-driver=builtin: use built-in port driver (recommended)
  • --port-driver=socat: use socat binary (deprecated)
  • --port-driver=slirp4netns: use slirp4netns API (deprecated)

Benchmark (October 13, 2019):

--port-driver Throughput
builtin 27.3 Gbps
slirp4netns 8.3 Gbps
socat 5.2 Gbps

For example, to expose 80 in the child as 8080 in the parent:

$ rootlesskit --state-dir=/run/user/1001/rootlesskit/foo --net=slirp4netns --disable-host-loopback --copy-up=/etc --port-driver=builtin bash
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock add-ports 0.0.0.0:8080:80/tcp
1
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock list-ports
ID    PROTO    PARENTIP   PARENTPORT    CHILDPORT    
1     tcp      0.0.0.0    8080          80
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock remove-ports 1
1

You can also expose ports using socat and nsenter instead of RootlessKit's port drivers.

$ pid=$(cat /run/user/1001/rootlesskit/foo/child_pid)
$ socat -t -- TCP-LISTEN:8080,reuseaddr,fork EXEC:"nsenter -U -n -t $pid socat -t -- STDIN TCP4\:127.0.0.1\:80"

About

executes `unshare` and `newuidmap/newgidmap` in a single command, plus slirp

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 92.3%
  • Shell 3.9%
  • Dockerfile 3.4%
  • Makefile 0.4%