Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.3] Implement user namespaces #1

Draft
wants to merge 14 commits into
base: release/1.3
Choose a base branch
from
Draft

[1.3] Implement user namespaces #1

wants to merge 14 commits into from

Conversation

alban
Copy link
Member

@alban alban commented Jun 18, 2020

This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal.

The patches are based on the release/1.3 branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#3).

The main changes are:

  • Import the new CRI API from Kubernetes PR 64005.
  • Implement GetRuntimeConfigInfo returning the hard coded uid mapping with 100000
  • Use the WithRemappedSnapshot snapshotter
  • Fix sysfs mount with correct ownership of netns (see commitmsg for details)
  • Additional mount restrictions on /dev/shm (nosuid, noexec, nodev)
  • Chown /dev/shm appropriately
  • Fix etc-hosts mounts with supplementary groups (see commitmsg for details)
  • Use custom options WithoutNamespace, WithUserNamespace, WithLinuxNamespace at the right places

At the OCI level (config.json), we have the following changes:

  • sandbox container with new "user" namespace and UidMappings
  • normal containers with "user" namespace from a path and UidMappings

Demo:

$ kubectl apply -f userns-tests/pod-simple.yaml 
pod/pod-simple created
$ kubectl apply -f userns-tests/pod-userns.yaml
pod/pod-userns created
$ kubectl exec -ti pod-simple -- /bin/sh -c 'cat /proc/self/uid_map'
         0          0 4294967295
$ kubectl exec -ti pod-userns -- /bin/sh -c 'cat /proc/self/uid_map'
         0     100000      65535
  • userns-tests/pod-simple.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-simple
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "disabled"
spec:
  containers:
  - name: pod-simple
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]
  • userns-tests/pod-userns.yaml:
apiVersion: v1
kind: Pod
metadata:
  name: pod-userns
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "enabled"
spec:
  containers:
  - name: pod-userns
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]

TODO:

  • Make the uid mapping configurable instead of hard coding 100000.
  • Try on the master branch (presumably upstream wants an implementation on master first)
  • Tests.
  • Implement NamespaceMode=NODE_WIDE_REMAPPED correctly. At the moment, NamespaceMode=POD and NamespaceMode=NODE are implemented correctly.
  • Check the status of volumes.

/cc @mauriciovasquezbernal

alban added 9 commits June 15, 2020 14:16
…n, User

Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR
64005.
With this patch, GetRuntimeConfigInfo() returns a config without user
namespace: the only uid/gid mapping is the one from the root user
namespace.
Before this patch, containerd created a netns, configured it with CNI,
and then creates the sandbox container by giving the netns path previously
setup. This means that the netns was owned by the host userns. Mounting
sysfs in the container is restricted in this setup.

This patch sets up the netns in the other way around instead: it creates
the sandbox container, letting runc create a new netns. Then, it picks
the new netns from /proc/$pid/ns/net, binds mount it in the usual CNI
path and then gives it to CNI to configure. This means that the netns is
owned by the userns of the sandbox container. In this way, mounting
sysfs is possible.

For more information about namespace ownership, see
- man ioctl_ns
- man user_namespaces, section "Interaction of user namespaces and other types of namespaces"
- Linux commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs")
  torvalds/linux@7dc5dbc#diff-4839664cd0c8eab716e064323c7cd71fR1164
- net_current_may_mount() used for mounting sysfs:
  ns_capable(net->user_ns, CAP_SYS_ADMIN);
  https://github.com/torvalds/linux/blob/v5.7/net/core/net-sysfs.c#L1679
The sandbox container (aka "pause" container) has a tmpfs mount on
/dev/shm. Bind mount it with nosuid, noexec, nodev because the mount
would not be allowed in user namespaces otherwise.
runc needs to bind mount files from /var/lib/kubelet/pods/... (such as
etc-hosts) into the container. When using user namespaces, the bind
mount didn't work anymore when containerd is started from a systemd unit.

This patch fixes that by adding SupplementaryGroups=0

runc needs to have permission on the directory to stat() the source of
the bind mount. Without user namespaces, this is not a problem since
runc is running as root, so it has 'rwx' permissions over the directory:

drwxr-x---. 8 root   root   4096 May 28 18:05 /var/lib/kubelet

Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount
phase happens before giving up the additional permissions.

However, when using user namespaces, the runc process is belonging to a
different user than root (depending on the mapping). /var/lib/kubelet is
seen as belonging to the special unmapped user (65534, nobody). runc
does not have 'rwx' permissions anymore but the empty '---' permission
for 'other'.

CAP_DAC_OVERRIDE is also no effective because the kernel performs the
capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE).
This checks that the owner of the /var/lib/kubelet is mapped in the
current user namespace, which is not the case.

Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was
working when containerd was started manually with 'sudo' but not working
when started from a systemd unit. The difference is how supplementary
groups are handled between sudo and systemd units: systemd does not set
supplementary groups by default.

$ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:	0

$ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Running as unit: run-u296886.service
Press ^] three times within 1s to disconnect TTY.
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:

When runc has the supplementary group 0 configured, it is retained
during the bind-mount phase, even though it is an unmapped group (runc
temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc
effectively has the 'r-x' permissions over /var/lib/kubelet. This makes
the bind mount of etc-hosts work.

After the mount phase, runc will set the credential correctly (following
OCI's config.json specification), so the container will not retain this
unmapped supplementary group.

I rely on the systemd unit file being configured correctly with
SupplementaryGroups=0 and I don't attempt to set it up automatically
with syscall.Setgroups() because "at the kernel level, user IDs and
group IDs are a per-thread attribute" (man setgroups) and the way Golang
uses threads make it difficult to predict which thread is going to be
used to execute runc. glibc's setgroup() is a wrapper that changes the
credentials for all threads but Golang does not use the glibc
implementation.
alban added 4 commits June 24, 2020 14:31
Example of possible configuration:
```
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping]
      container_id = 0
      host_id = 300000
      size = 65536
    [plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping]
      container_id = 0
      host_id = 300000
      size = 65536
```
Return an error if NODE_WIDE_REMAPPED is requested
Copy link
Member

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. I think we could remove NODE_WIDE_REMAPPED everywhere on this PR as we removed that on the kubelet one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants