Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User namespaces doc changes for 1.30 #45178

Merged
merged 5 commits into from
Mar 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 78 additions & 12 deletions content/en/docs/concepts/workloads/pods/user-namespaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ min-kubernetes-server-version: v1.25
---

<!-- overview -->
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
{{< feature-state for_k8s_version="v1.30" state="beta" >}}

This page explains how user namespaces are used in Kubernetes pods. A user
namespace isolates the user running inside the container from the one
Expand Down Expand Up @@ -46,7 +46,26 @@ tmpfs, Secrets use a tmpfs, etc.)
Some popular filesystems that support idmap mounts in Linux 6.3 are: btrfs,
Copy link
Member

@drewhagen drewhagen Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what do we think about presenting the concept of idmap mounts for newcomers that aren't familiar with this from Linux?
https://lwn.net/Articles/896255/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I'd cover that:

  • have a blog article that explains Kubernetes and idmapped mounts
  • publish that article a week or so before the release
  • mark the article as evergreen (ie: we'll maintain it; usually we don't maintain blog articles once they reach a year old)
  • link from these user namespaces docs to that evergreen blog article

An example of an evergreen article: https://kubernetes.io/blog/2020/09/03/warnings/

Copy link
Member Author

@rata rata Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO that seems like too much overhead. It's not clear the value it would have to have such an ongoing investment to keep it up to date either, as it is suggested. There are other places that do document idmap mounts, not sure I see the point of a k8s specific documentation (there isn't anything k8s-specific that I can think of).

Furthermore, Kubernetes users just need to know if a filesystem supports idmap mounts or not. Nothing else is really relevant.

ext4, xfs, fat, tmpfs, overlayfs.

In addition, support is needed in the
In addition, the container runtime and its underlying OCI runtime must support
user namespaces. The following OCI runtimes offer support:

* [crun](https://github.com/containers/crun) version 1.9 or greater (it's recommend version 1.13+).

<!-- ideally, update this if a newer minor release of runc comes out, whether or not it includes the idmap support -->
{{< note >}}
Many OCI runtimes do not include the support needed for using user namespaces in
Linux pods. If you use a managed Kubernetes, or have downloaded it from packages
and set it up, it's likely that nodes in your cluster use a runtime that doesn't
include this support. For example, the most widely used OCI runtime is `runc`,
and version `1.1.z` of runc doesn't support all the features needed by the
Kubernetes implementation of user namespaces.

If there is a newer release of runc than 1.1 available for use, check its
documentation and release notes for compatibility (look for idmap mounts support
in particular, because that is the missing feature).
{{< /note >}}

To use user namespaces with Kubernetes, you also need to use a CRI
{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}
to use this feature with Kubernetes pods:

Expand Down Expand Up @@ -137,20 +156,67 @@ use, see `man 7 user_namespaces`.

## Set up a node to support user namespaces

It is recommended that the host's files and host's processes use UIDs/GIDs in
the range of 0-65535.
By default, the kubelet assigns pods UIDs/GIDs above the range 0-65535, based on
the assumption that the host's files and processes use UIDs/GIDs within this
range, which is standard for most Linux distributions. This approach prevents
any overlap between the UIDs/GIDs of the host and those of the pods.

Avoiding the overlap is important to mitigate the impact of vulnerabilities such
as [CVE-2021-25741][CVE-2021-25741], where a pod can potentially read arbitrary
files in the host. If the UIDs/GIDs of the pod and the host don't overlap, it is
limited what a pod would be able to do: the pod UID/GID won't match the host's
file owner/group.

The kubelet can use a custom range for user IDs and group IDs for pods. To
configure a custom range, the node needs to have:

* A user `kubelet` in the system (you cannot use any other username here)
* The binary `getsubids` installed (part of [shadow-utils][shadow-utils]) and
in the `PATH` for the kubelet binary.
* A configuration of subordinate UIDs/GIDs for the `kubelet` user (see
[`man 5 subuid`](https://man7.org/linux/man-pages/man5/subuid.5.html) and
[`man 5 subgid`](https://man7.org/linux/man-pages/man5/subgid.5.html)).

This setting only gathers the UID/GID range configuration and does not change
the user executing the `kubelet`.

You must follow some constraints for the subordinate ID range that you assign
to the `kubelet` user:

* The subordinate user ID, that starts the UID range for Pods, **must** be a
multiple of 65536 and must also be greater than or equal to 65536. In other
words, you cannot use any ID from the range 0-65535 for Pods; the kubelet
imposes this restriction to make it difficult to create an accidentally insecure
configuration.

* The subordinate ID count must be a multiple of 65536

* The subordinate ID count must be at least `65536 x <maxPods>` where `<maxPods>`
is the maximum number of pods that can run on the node.

* You must assign the same range for both user IDs and for group IDs, It doesn't
matter if other users have user ID ranges that don't align with the group ID
ranges.

* None of the assigned ranges should overlap with any other assignment.

* The subordinate configuration must be only one line. In other words, you can't
have multiple ranges.

The kubelet will assign UIDs/GIDs higher than that to pods. Therefore, to
guarantee as much isolation as possible, the UIDs/GIDs used by the host's files
and host's processes should be in the range 0-65535.
For example, you could define `/etc/subuid` and `/etc/subgid` to both have
these entries for the `kubelet` user:

Note that this recommendation is important to mitigate the impact of CVEs like
[CVE-2021-25741][CVE-2021-25741], where a pod can potentially read arbitrary
files in the hosts. If the UIDs/GIDs of the pod and the host don't overlap, it
is limited what a pod would be able to do: the pod UID/GID won't match the
host's file owner/group.
```
# The format is
# name:firstID:count of IDs
# where
# - firstID is 65536 (the minimum value possible)
# - count of IDs is 110 (default limit for number of) * 65536
kubelet:65536:7208960
```

[CVE-2021-25741]: https://github.com/kubernetes/kubernetes/issues/104980
[shadow-utils]: https://github.com/shadow-maint/shadow

## Integration with Pod security admission checks

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,12 @@ _build:
render: false

stages:
- stage: alpha
- stage: alpha
defaultValue: false
fromVersion: "1.28"
toVersion: "1.29"
- stage: beta
defaultValue: false
fromVersion: "1.30"
---
Enable user namespace support for Pods.
39 changes: 25 additions & 14 deletions content/en/docs/tasks/configure-pod-container/user-namespaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ min-kubernetes-server-version: v1.25
---

<!-- overview -->
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
{{< feature-state for_k8s_version="v1.30" state="beta" >}}
sftim marked this conversation as resolved.
Show resolved Hide resolved

This page shows how to configure a user namespace for pods. This allows you to
isolate the user running inside the container from the one in the host.
Expand Down Expand Up @@ -57,10 +57,6 @@ If you have a mixture of nodes and only some of the nodes provide user namespace
Pods, you also need to ensure that the user namespace Pods are
[scheduled](/docs/concepts/scheduling-eviction/assign-pod-node/) to suitable nodes.

Please note that **if your container runtime doesn't support user namespaces, the
`hostUsers` field in the pod spec will be silently ignored and the pod will be
created without user namespaces.**

<!-- steps -->

## Run a Pod that uses a user namespace {#create-pod}
Expand All @@ -82,27 +78,42 @@ to `false`. For example:
kubectl attach -it userns bash
```

And run the command. The output is similar to this:
Run this command:

```none
```shell
readlink /proc/self/ns/user
```

The output is similar to:

```shell
user:[4026531837]
```

Also run:

```shell
cat /proc/self/uid_map
0 0 4294967295
```

Then, open a shell in the host and run the same command.
The output is similar to:
```shell
0 833617920 65536
```

Then, open a shell in the host and run the same commands.

The `readlink` command shows the user namespace the process is running in. It
should be different when it is run on the host and inside the container.

The output must be different. This means the host and the pod are using a
different user namespace. When user namespaces are not enabled, the host and the
pod use the same user namespace.
The last number of the `uid_map` file inside the container must be 65536, on the
host it must be a bigger number.

If you are running the kubelet inside a user namespace, you need to compare the
output from running the command in the pod to the output of running in the host:

```none
```shell
readlink /proc/$pid/ns/user
user:[4026534732]
```

replacing `$pid` with the kubelet PID.