forked from containerd/cri
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.3] Implement user namespaces #1
Draft
alban
wants to merge
14
commits into
release/1.3
Choose a base branch
from
alban/userns
base: release/1.3
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…n, User Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR 64005.
With this patch, GetRuntimeConfigInfo() returns a config without user namespace: the only uid/gid mapping is the one from the root user namespace.
Before this patch, containerd created a netns, configured it with CNI, and then creates the sandbox container by giving the netns path previously setup. This means that the netns was owned by the host userns. Mounting sysfs in the container is restricted in this setup. This patch sets up the netns in the other way around instead: it creates the sandbox container, letting runc create a new netns. Then, it picks the new netns from /proc/$pid/ns/net, binds mount it in the usual CNI path and then gives it to CNI to configure. This means that the netns is owned by the userns of the sandbox container. In this way, mounting sysfs is possible. For more information about namespace ownership, see - man ioctl_ns - man user_namespaces, section "Interaction of user namespaces and other types of namespaces" - Linux commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs") torvalds/linux@7dc5dbc#diff-4839664cd0c8eab716e064323c7cd71fR1164 - net_current_may_mount() used for mounting sysfs: ns_capable(net->user_ns, CAP_SYS_ADMIN); https://github.com/torvalds/linux/blob/v5.7/net/core/net-sysfs.c#L1679
The sandbox container (aka "pause" container) has a tmpfs mount on /dev/shm. Bind mount it with nosuid, noexec, nodev because the mount would not be allowed in user namespaces otherwise.
rata
reviewed
Jun 18, 2020
rata
reviewed
Jun 18, 2020
runc needs to bind mount files from /var/lib/kubelet/pods/... (such as etc-hosts) into the container. When using user namespaces, the bind mount didn't work anymore when containerd is started from a systemd unit. This patch fixes that by adding SupplementaryGroups=0 runc needs to have permission on the directory to stat() the source of the bind mount. Without user namespaces, this is not a problem since runc is running as root, so it has 'rwx' permissions over the directory: drwxr-x---. 8 root root 4096 May 28 18:05 /var/lib/kubelet Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount phase happens before giving up the additional permissions. However, when using user namespaces, the runc process is belonging to a different user than root (depending on the mapping). /var/lib/kubelet is seen as belonging to the special unmapped user (65534, nobody). runc does not have 'rwx' permissions anymore but the empty '---' permission for 'other'. CAP_DAC_OVERRIDE is also no effective because the kernel performs the capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE). This checks that the owner of the /var/lib/kubelet is mapped in the current user namespace, which is not the case. Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was working when containerd was started manually with 'sudo' but not working when started from a systemd unit. The difference is how supplementary groups are handled between sudo and systemd units: systemd does not set supplementary groups by default. $ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 $ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status Running as unit: run-u296886.service Press ^] three times within 1s to disconnect TTY. Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: When runc has the supplementary group 0 configured, it is retained during the bind-mount phase, even though it is an unmapped group (runc temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc effectively has the 'r-x' permissions over /var/lib/kubelet. This makes the bind mount of etc-hosts work. After the mount phase, runc will set the credential correctly (following OCI's config.json specification), so the container will not retain this unmapped supplementary group. I rely on the systemd unit file being configured correctly with SupplementaryGroups=0 and I don't attempt to set it up automatically with syscall.Setgroups() because "at the kernel level, user IDs and group IDs are a per-thread attribute" (man setgroups) and the way Golang uses threads make it difficult to predict which thread is going to be used to execute runc. glibc's setgroup() is a wrapper that changes the credentials for all threads but Golang does not use the glibc implementation.
Example of possible configuration: ``` [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping] container_id = 0 host_id = 300000 size = 65536 [plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping] container_id = 0 host_id = 300000 size = 65536 ```
Return an error if NODE_WIDE_REMAPPED is requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments. I think we could remove NODE_WIDE_REMAPPED everywhere on this PR as we removed that on the kubelet one.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal.
The patches are based on the
release/1.3
branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#3).The main changes are:
GetRuntimeConfigInfo
returning the hard coded uid mapping with 100000WithRemappedSnapshot
snapshotterWithoutNamespace
,WithUserNamespace
,WithLinuxNamespace
at the right placesAt the OCI level (config.json), we have the following changes:
Demo:
TODO:
NamespaceMode=NODE_WIDE_REMAPPED
correctly. At the moment,NamespaceMode=POD
andNamespaceMode=NODE
are implemented correctly./cc @mauriciovasquezbernal