Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keps/127: Support User Namespaces #2101

Conversation

mauriciovasquezbernal
Copy link
Contributor

This PR adds a KEP proposing to support user namespaces. It is not a new topic in Kubernetes as it has been discussed multiple times in the past. #127 was tracking the Support Node-Level User Namespaces Remapping design proposal. There was also a PR implementing that but it was never merged.

This new PR replaces #1903 that only contained the summary and motivations, this new PR now contains a complete proposal.

@rata @alban @sargun, @solarkennedy, @jigish, @giuseppe, @rhatdan

/cc @mrunalp
/sig node

@k8s-ci-robot k8s-ci-robot requested a review from mrunalp October 16, 2020 14:35
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 16, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @mauriciovasquezbernal. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 16, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mauriciovasquezbernal
To complete the pull request process, please assign derekwaynecarr after the PR has been reviewed.
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Oct 16, 2020
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are UID ranges for Pods a resource that the scheduler should account for?

Pathologically: If a Node decides to assign 2^29 UIDs to each pod than the scheduler could place more Pods there than actually fit.

keps/sig-node/127-usernamespaces-support/README.md Outdated Show resolved Hide resolved
keps/sig-node/127-usernamespaces-support/README.md Outdated Show resolved Hide resolved
keps/sig-node/127-usernamespaces-support/README.md Outdated Show resolved Hide resolved
@mauriciovasquezbernal
Copy link
Contributor Author

Are UID ranges for Pods a resource that the scheduler should account for?

Pathologically: If a Node decides to assign 2^29 UIDs to each pod than the scheduler could place more Pods there than actually fit.

It's something we have thought about but we haven't gone into details as it's not part of the phase 1. I'm not sure it's needed to involve the scheduler here as the IDs is a not a very constrained resource, a Linux node has 2^32 IDs available, if the user gives 25% of those IDs to the kubelet and each pod has 65536 IDs, it'll be possible to run 16384 pods there. For sure it's just a raw calculation I'm doing here and it's something we should discuss in detail when defining phase 2.

@mrunalp
Copy link
Contributor

mrunalp commented Oct 20, 2020

@mauriciovasquezbernal it makes sense to start with 65536 sized ranges in the first phase and then adjust that if needed based on feedback. We should call that out in the kep.

@mauriciovasquezbernal
Copy link
Contributor Author

mauriciovasquezbernal commented Oct 22, 2020

@mauriciovasquezbernal it makes sense to start with 65536 sized ranges in the first phase and then adjust that if needed based on feedback. We should call that out in the kep.

Makes sense. Added a note about it.

and group IDs. Kubelet creates the files for those volumes and it can easily set
the file ownership too.

Non-ephemeral volumes are more difficult to support since they can be persistent
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PVC, as it shares inside namespace, would it be good to have a new Namespaced mode, which every pod in a namespace shares the same uidmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something that was proposed a while ago too https://github.com/kubernetes/community/pull/2595/files#diff-473657a51a9a6deb70241820b28201d0653f398f9f0736a47f0d23ca57864fa6R56.

We don't see big benefits on this approach as the isolation is partially and pods inside the same namespace are still not isolated among them. It also complicates the design as there is a new mode where ranges have to be allocated to.

We'd prefer to avoid this mode and see if we are able to cover most user needs with the Cluster mode. If we realize it's really needed we could do a further iteration on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think namespaced mode is way more valuable than "cluster" mode. I'm not very keen on cluster mdoe at all.

I'd go one further - we don't really want namespace mode, we want ServiceAccount mode. What if we can allocate one cluster-unique userID per ServiceAccount? Any pod running as that SA gets that host-UID when running with userNS turned on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "namespace" and "service-account" modes are interesting ideas.

The main goal we're trying to accomplish with this KEP is to increase pod-to-host isolation. We know that increasing pod-to-pod isolation would be also great but by doing the former we'll protect the host at least.

The most difficult part of supporting user namespaces are the volumes, the lack of an ID shift (or similar) mechanism in the Linux kernel makes it so difficult to share a volume among pods with different mappings. If such mechanism were implemented probably we're not having this conversation.

ClusterMode is an easy way to solve these volumes issues, yes, it's not a perfect solution but enables us to use user namespaces in some cases. With the new proposal we're doing in the other comment we're trying to keep this as simple as possible by enabling user namespaces in some basic cases and letting the possibility to cover more cases in future iterations.

I think trying to cover too many cases from the beginning is very difficult, maybe we can start by the basic ones and in the meanwhile keep and eye on new kernel proposals like id-mapped mounts that would definitely help with the more complicated ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with starting with basic, broader protections given the difficulties and open questions of finer-grained approaches, and that it's a very positive step to improve Pod-Host isolation. This proposal seems like a good start on that.

@mauriciovasquezbernal One question I have (curiosity more than anything) is what risks are inherent to the Cluster mode? Part of the reason for protecting the host is to protect Pods from each other, how much can I do to other Pods if I'm in the same user ns? I assume (maybe naïvely) the same things as if I broke out of a container and was in host user ns today? There are ofc many other reasons to protect the host so I think Cluster mode is still valuable as a first step, but I am curious.

@thockin Can we imagine any scenarios where two Pods running as distinct service accounts would want to write to the same PVC? Maybe that's just a bad architectural choice and we should discourage it anyway...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtaufen before answering, I want to clarify that the Cluster mode intends to use the same ID mapping but pods are in different user namespaces. Said that, if you are able to break out into the host user namespace you'll be able to read/write files of other pods, send signals, change scheduling parameters and other things described in the "User and group identifiers" section of https://man7.org/linux/man-pages/man7/credentials.7.html. It's a similar situation as today if you run pods as non-root and you break into the host: you'll have a process running as non-root on the host. On the other hand, if you run pods as root the situation would be very different, you'll have a root process in the host and will be able to compromise all pods too, user namespaces mitigate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense.

keps/sig-node/127-usernamespaces-support/README.md Outdated Show resolved Hide resolved

- **Pod**:
Each pod is placed in a different user namespace and has a different and
non-overlapping ID mapping across the node. This mode is intended for stateless pods, i.e.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or stateful pods that don't have multiple-access volumes (which seems to be most volumes, anecdotally) right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to worry about reclaimPolicy: Retain in addition to multi-access.

How many stateful scenarios do we enable by using service account as the atom instead of pod (i.e. all pods running as a service account have the same mapping, the mapping is unique to the service account)? There are some other benefits to mapping based on service account too, like better uid information in the audit subsystem logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw the discussion below about a per-(k8s-)namespace mode and IIUC similar benefits of per-service-account would apply to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure Retain matters so much - the volume is "done". It's not saved for later use, it's pending some admin action to destroy it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with stateful pods, even if they access non-shared volume, is that the pod would have to get the same mappings each time it's scheduled, otherwise it'll be running with a different IDs on the host each time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is why Namespace or ServiceAccount modes are so attractive :)

use the same ID mappings. This mode doesn't provide full pod-to-pod isolation
as all the pods with `Cluster` mode have the same effective IDs on the host.
It provides pod-to-host isolation as the IDs are different inside the
container and in the host. This mode is intended for pods sharing volumes as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically this is for multi-reader/multi-writer volumes where 2 pods SIMULTANEOUSLY access a volume, yes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and even then, we have fsGroup which should be a better option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW - do we REALLY need this half-baked mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn’t have to be simultaneously. If a pod writes a file to a persistent volume with mode 0600, then a new pod would have issues accessing that file if the ID mappings of both pods are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fsGroup can definitely help, but not for some cases. fsGroup has some limitations too. It is based on the assumption that all files written to the volume have read / write access by that supplemental group. If a file is written with a 0600 mode it won't work. It also doesn’t work for all volume types, for instance NFS doesn’t allow to change the ownership of the files when the no_root_squash option is not specified in the mount.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to clarify, moving the mode (cluster, pod) away from the pod spec seems like a good improvement to the KEP, right? Did I follow that correctly?

I am not sure - I think we need to play out the rest to know the answer here.

We proposed one strategy that is a “fixed” mapping, that is: each pod has its own user namespace but all pods share the same mapping.

This is utterly pragmatic, but it only addresses 1 requirement - protect the
node from pods. Any attack that escapes one pod automatically compromises all
other pods on that node. I feel like we can do better.

However, Red Hat for example (cc @mrunalp ) wants mappings where non-overlapping ranges are given to pods. That is, no two pods are running in the same node with the same mappings

IMO this is a step too far unless the FS-related issues are addressed. As a
final goal, it sounds great. But it suffers the same problems described
elsewhere - the inside space and the outside space are the same size, which
means you can't GUARANTEE that it works for any arbitrary 2^32 UID.

What I like about the fixed mapping is that it is super simple, works for MOST use cases (even sharing volumes with different namespaces) but gives you WAY more security that what we currently have without user namespaces.

I can't agree with "WAY more". A bit more, yes.

I’m not convinced about per-service-account mapping: if you want to share a volume, you also need to share a service account and if the pods are using the service account for something, now it needs to have the union of all the permissions? Seems too hacky and a trade-off in security (you use user ns, that is good thing security wise, but now your service accounts have more permissions than needed for some pods)

I have a STRONG feeling that pods which are collaborating on a volume will tend
to have the same ServiceAccount. I don't have hard data for that. That said,
it feels like a relatively small increase in complexity over fixed mapping with
a low (IMO) risk of failure) and much better security. The good news is that
there's a clean "out" in the form of hostUserNamespace which we could advise
while we consider other options.

I’m sure these limits will be a problem in the future and will come back to bite us_

The more "niche" the use-case, the less bad I feel about telling people to use
the host namespace :)

and I’m also not quite sure how the implementation of this would be (what a “cluster-unique” allocation for the mapping per ns looks like?).

The simple answer is to take the UID of the SA (or the hash of the UID) %
(2^32 - 2) (because -1 is reserved) & ~(0xffff). That gives us approximately
64K ID-spaces of 64K IDs each. If we don't need that many IDs, We can shift it
to something like 4 million ID-spaces of 1024 Ids each. Yes there's a chance
of collision, but low. Yes, it still ignores the fact that any given pod could
use IDs outside the range we assign, regardless of what range we assign.

The complete solution involves new API resources and allocation and things
like that. I am disinclined to devise the solution until we think we really
need it.

I’d like to know what he thinks, as he really wanted non-overlapping mappings.

If we thought that a solution to the volumes problem was less than a year away,
we could take any one of these as a stepping-stone. At that point, the less
surface area, the better. Especially since this is opt-in and any solution
is better than what we have today.

It probably makes sense to configure which ranges to use for any strategy (per-ns, fixed, non-overlapping or whatever), so one configuration in the Kubelet probably can be added. Or do you mean to have sensible default values for the ranges? Like use UID 10 millon or $SOMETHING_USUALLY_UNUSED on the host, as that is a good default for most systems and you can change it if you want? That SGTM too.

If we need a value that is consistent across machines, never changes, and is
not something a user would need to tweak - sure. E.g. the overflow UID. As
soon as we tell users "go tweak this to change how your users experience
kubernetes" or "set this to improve security", flags are just awful.

I think we should aspire to either a) totally automatic or b) exposed via the
API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a step too far unless the FS-related issues are addressed. As a
final goal, it sounds great. But it suffers the same problems described
elsewhere - the inside space and the outside space are the same size, which
means you can't GUARANTEE that it works for any arbitrary 2^32 UID.

in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.

We don't handle yet arbitrary IDs in the image files set to higher values than what is available in the inner user namespace, e.g. given an allocation of 65k IDs, a file owned by 200000:200000 will show up as owned by the overflow ID.

I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.

Another possibility would be to create a per-pod user namespace and then each container runs in their own sub user namespace, child of the pod user namespace; but that would probably require changes in the OCI runtime as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.

I'm curious, how does it work when the container run as root?

I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.

I think there are two different scenarios. One is to guarantee that a mapping for the files on the container image exist, that's something that could be implemented. The other case are IDs on the fly that a container could try to use, that seems very complicated to cover IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.

I'm curious, how does it work when the container run as root?

with the current annotations mechanism, root is treated as any other user so a container that specifies runAsUser=0 will be mapped to root on the host, since that is the assumption from the outside. It is possible though to use runAsUser= with a nonzero UID and specify to map it to root in the user namespace.

I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.

I think there are two different scenarios. One is to guarantee that a mapping for the files on the container image exist, that's something that could be implemented. The other case are IDs on the fly that a container could try to use, that seems very complicated to cover IMO.

I don't think the second case can be solved considering how user namespaces work now. The assumption is that the container, even if running with CAP_SETUID/CAP_SETGID, will still be limited to use the IDs that are present in the image, either through files ownership or users defined in /etc/passwd.

If a container needs access to any of the 2^32 UIDs, then IMO it should not run in a user namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin thanks! responses inline.

So, to clarify, moving the mode (cluster, pod) away from the pod spec seems like a good improvement to the KEP, right? Did I follow that correctly?

I am not sure - I think we need to play out the rest to know the answer here.

Heh, okay, let’s wait. But if we don’t want to create a new API object to specify the mapping (as you mentioned and I agree not doing that now) and we want to reduce the surface we expose, I think this bool to use host userns or not should be there for sure, and probably not much more (maybe some other thing, but hopefully we can just enable userns and that is all you need to do).

But let’s see what we arrive at :)

This is utterly pragmatic, but it only addresses 1 requirement - protect the node from pods. Any attack that escapes one pod automatically compromises all other pods on that node. I feel like we can do better.

Heh, yes, this is pragmatic indeed. There is probably some more isolation between pods as they run in different user namespaces (like can’t read the mem of a pod in a different userns even if /proc is exposed). But yeah, if a breakout happens, all pods in the node can be compromised.

The other upside non-mentioned of a fixed mapping is that it works with containers that need big/long mappings too, as we can allow almost all the 2^32 UID range inside the container. In contrast, in the case of per-ns/per-sa we need to guess which mappings will be used and hope for the best or ask to use the host user namespace in those cases (or expose an API object in k8s with the ranges needed or have some metadata about the containers with the used UIDs, etc.).

Those images with high UIDs that can’t be easily guessed happen to be our use case too. Although I see that maybe it is not an issue for several users and the appeal of per-sa or per-ns.

However, Red Hat for example (cc @mrunalp ) wants mappings where non-overlapping ranges are given to pods. That is, no two pods are running in the same node with the same mappings

IMO this is a step too far unless the FS-related issues are addressed. As a final goal, it sounds great. But it suffers the same problems described elsewhere - the inside space and the outside space are the same size, which means you can't GUARANTEE that it works for any arbitrary 2^32 UID.

Good point, fully agree. Until the kernel improves in some way (supports 64 bits kuids or something) and we can properly emulate userns of 2^32 UIDs, but not for now.

I have a STRONG feeling that pods which are collaborating on a volume will tend to have the same ServiceAccount. I don't have hard data for that. That said, it feels like a relatively small increase in complexity over fixed mapping with a low (IMO) risk of failure) and much better security. The good news is that there's a clean "out" in the form of hostUserNamespace which we could advise while we consider other options.

Good point, yes, it seems a reasonable trade off regarding complexity and security. I’m unsure if they will share a SA, if you consider volumes that are typically written by many (like NFS), but I’m willing to find out :-D. I have a soft preference for per-ns (maybe being utterly pragmatic again?), but both very good ideas, thanks!

However, for our use case we have containers with high IDs that can’t be easily guessed. For that reason some way to have a fixed mapping will be very useful for us, so we can specify one long mapping to be used by all.

If you think that any configuration knob, to support containers with a wide range of UIDs is a complete no-go, we would probably continue with the other options (as the patch we need to do for Kubernetes with a fixed mapping will be simpler than today). But if you think it is okay to allow that in some way in early revisions, I’d really like to know which ways you think might be okay for that :)

and I’m also not quite sure how the implementation of this would be (what a “cluster-unique” allocation for the mapping per ns looks like?).

The simple answer is to take the UID of the SA (or the hash of the UID) % (2^32 - 2) (because -1 is reserved) & ~(0xffff). That gives us approximately 64K ID-spaces of 64K IDs each. If we don't need that many IDs, We can shift it to something like 4 million ID-spaces of 1024 Ids each. Yes there's a chance of collision, but low. Yes, it still ignores the fact that any given pod could use IDs outside the range we assign, regardless of what range we assign

Oh, good point! Yes, I like a simple solution like that: doesn’t guarantee, but is quite low chances of collision (at least for me, that I’m not an expert on that matter). Thanks!

The complete solution involves new API resources and allocation and things like that. I am disinclined to devise the solution until we think we really need it.

I fully agree.

The only “downside” that I can find is not a big one: you can’t easily know the mapping in advance, so you can’t just add to your yaml “useHostUserNs: false” or you can’t easily migrate volumes to a new cluster (the UID will change), you will need to adjust (one time) your files permissions when switching to userns. Maybe this is even trivial as maybe most cases can use fsGroupChangePolicy when doing so and others can adjust as they need?

In any case, I don’t think that is a big problem and applies to any way of using userns with volumes without exposing the mapping. I agree we don’t want to expose the mapping, at least for now, so a downside we will have to live with (and not so bad, IMHO).

If we thought that a solution to the volumes problem was less than a year away, we could take any one of these as a stepping-stone. At that point, the less surface area, the better. Especially since this is opt-in and any solution is better than what we have today.

I think it is really hard to know when the kernel will improve on that. But definitely, less surface makes sense and anything is better than what we currently have, fully agree on that :). There is ongoing work on the kernel front, hopefully[1].

If we need a value that is consistent across machines, never changes, and is not something a user would need to tweak - sure. E.g. the overflow UID. As soon as we tell users "go tweak this to change how your users experience kubernetes" or "set this to improve security", flags are just awful.
I think we should aspire to either a) totally automatic or b) exposed via the API.

Heh, I hear the point of hosted solutions, I wasn’t really aware of this (makes sense, but haven’t heard it before). I think what is coming out of these conversations is quite nice and achieves that. Thanks a lot for stepping in @thockin! :)

[1]: lwn article: https://lwn.net/Articles/837566/. V4 is out now, some weeks ago.

and group IDs. Kubelet creates the files for those volumes and it can easily set
the file ownership too.

Non-ephemeral volumes are more difficult to support since they can be persistent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think namespaced mode is way more valuable than "cluster" mode. I'm not very keen on cluster mdoe at all.

I'd go one further - we don't really want namespace mode, we want ServiceAccount mode. What if we can allocate one cluster-unique userID per ServiceAccount? Any pod running as that SA gets that host-UID when running with userNS turned on.

Non-ephemeral volumes are more difficult to support since they can be persistent
and shared by multiple pods. This proposal supports volumes with two different
strategies:
- The `Cluster` mode makes it easier for pods to share files using volumes when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any idea whether a namespace mode or SA mode would handle the same problem? Do users share volumes across SAs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SA the right semantics though? Easy to imagine shared access to one resource by multiple principals, generally speaking. Might still be the convenient choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem space feels something like same-origin on the Web, K8s namespace might be the natural analogue...

Also worth thinking about cases where users have two clusters with the same set of namespaces and expect to move workloads between them - could we ensure that the mapping was consistently configured for both clusters, even across concurrent, potentially out-of-order additions/removals of namespaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could derive the mapping from the namespace name, to guarantee it's the same in different clusters. However I don't know the security implications of allowing users to indirectly control the ranges by the namespace name.

Cases like this are another reason why we're want to have the same mapping for all pods with volumes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we started with something like Cluster mode (which I think is a great first step), it theoretically blocks us from transparently migrating to Namespace mode in the future, since mappings would likely change, right? At that point do we basically have to wait for shiftfs to make things more secure without additional user intervention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the security implications of allowing users to indirectly control the ranges by the namespace name.

It might not be a huge problem, given namespaces are probably created by cluster admins, controlling ranges might fit their privilege level anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we started with something like Cluster mode (which I think is a great first step), it theoretically blocks us from transparently migrating to Namespace mode in the future, since mappings would likely change, right?

Yes, users will have to update the volumes ownership. Btw, one of our goals with this proposal is to have something that works good for now and that allows to easily use a shiftfs-like solution when it'll be available. In one of our comments (#2101 (comment)) we're proposing to have a "Cluster" mode but implicit, to avoid changes to the pod spec.

At that point do we basically have to wait for shiftfs to make things more secure without additional user intervention?

Yes, I think shiftfs (or a similar solution) is the only way to avoid user intervention at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the clarification.

user ID 100000 in the host, so it appears to belong to root in the container.
The current implementation in container runtimes is to recursively perform a
`chown` operation over the image snapshot when it's pulled. This presents a risk
as it potentially increases the time and the storage needed to handle the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and argues for a more stable mapping

possible that the IDs range assigned to the pod is not big enough to accommodate
these IDs, in this case they will be mapped to the `nobody` user in the host.

It's not a big problem in the `Cluster` case, the users have to be sure that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about ranges - what are they used for and why are they needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range refers to the ID mapping the Pod gets. It's possible that a pod gets a small range and the UID that it runs with is not covered by that range.

For instance, if a pod gets the 0 100000 65536 mappings and a container inside the pod has USER: 200000, it'll be running as nobody on the host because the range of length 65536 is not long enough to map this ID to the host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need ranges for pods? Why not a single "outside" UID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that a pod uses multiple IDs inside. It's not possible to map all of those to the same "outside" ID but we need a range. If we only map one ID, then the other IDs will be seen as nobody - 65534 from the host point of view witch creates issues to access files that don't have "others" permissions.

Copy link
Member

@rata rata Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin It is implicit on what @mauriciovasquezbernal is saying but the Linux kernel forces you to have a range (you need a 1-1 mapping, a bijection).

I'd like the same you mention (map all UIDs/GIDs in the container to just one ID X on the host), but it is not possible to do that in Linux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's simply impossible to do with any guarantee. Each pod might use any UID from the entire space. More concretely, they could use UID 0 and UID 2^32-1 - requiring the whole range to map. The only way to guarantee non-collision on the outside is to limit which UIDs can be used on the inside to a "reasonable" range.

This seems like a pretty fundamental thing we need to design for, no?

Sorry if I am rehashing old discussion - I don't see this really covered here. Are we assuming some "standard" range size for every pod? Do pods get to express "I need more UIDs"?

One of the very fundamental principles of kubernetes is that users should not be subject to collisions that they did not cause. For example, sharing host ports.

Hypothetically: Assume we had to pre-allocate "outside" UIDs and the only UIDs you could use were those - would it be better? What would the unit of allocation-assignment? I think Namespace or ServiceAccount seems right. So if we carry that to the hypothetical end, what if we end up at a model like this:

  • ServiceAccount gets new "users[]" and "groups[]" fields. These describe the "inside" and are optional.
  • An IDAllocation resource. This is created in response to the above fields and assigns "outside" IDs to them.
  • When a pod runs as that SA it gets that UID mapping.
  • When the cluster runs out of IDs, no new IDAllocations can be made.
  • Allocation must be pluggable.

That's really the only way to ensure no outside collision. Is that a worthwhile goal? Or is the better goal to buy time until node-local ephemeral mappings are viable wrt filesystems? We still have the questions above - standard range size or configurable, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I am rehashing old discussion - I don't see this really covered here. Are we assuming some "standard" range size for every pod? Do pods get to express "I need more UIDs"?

Yes, 65k (UID 0 -> UID 2^16-1) seems to be that reasonable default range size (https://github.com/kubernetes/enhancements/pull/2101/files#diff-2d23f433c4d58d94c34afbadc193009e2f7b4b8eab1c6549f7bd96d81abf142bR399-R407). It should be enough for a pod and also allows to schedule up to 65k pods in a node (far away of what k8s handles these days).

We haven't considered a case where a pod can ask for more UIDs. We'd prefer to see a use case that needs it before proposing something similar.

One of the very fundamental principles of kubernetes is that users should not be subject to collisions that they did not cause. For example, sharing host ports.

I'm not sure what it means for k8s, but with user namespaces the quantity of IDs available for each pod will be reduced because it's a node-wide shared resource. We'll need to limit the inside range of UIDs a pod can use to something like 0 -> 65k, at least for rev 1. If a pod needs to use IDs outside that range it could share the host user namespace anyway.

Hypothetically: Assume we had to pre-allocate "outside" UIDs and the only UIDs you could use were those - would it be better? What would the unit of allocation-assignment? I think Namespace or ServiceAccount seems right. So if we carry that to the hypothetical end, what if we end up at a model like this:

I think you described the allocation algorithm in a high level detail. In general this algorithm should have a pool of host (outside) IDs it can use (maybe configurable by the operator?), each time an "allocation unit resource" (pod / serviceaccount / namespace depending on the outcome of our discussion) is created, this algorithm assigns a portion of that range to that resource.

ServiceAccount gets new "users[]" and "groups[]" fields. These describe the "inside" and are optional.

Do you mean to let the user control the "inside" part of the IDs mapping?

That's really the only way to ensure no outside collision. Is that a worthwhile goal? Or is the better goal to buy time until node-local ephemeral mappings are viable wrt filesystems? We still have the questions above - standard range size or configurable, etc.

That's a good question. What do you mean with "node-local ephemeral mapping"?
I think we should guarantee that each allocation unit gets a different and non-overlapping range. If we allocate per pod, each kubelet can make this decision, if we allocate per-namespace or per-serviceaccount we'll have to make this decision to an upper layer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I am rehashing old discussion - I don't see this really covered here. Are we assuming some "standard" range size for every pod? Do pods get to express "I need more UIDs"?

Yes, 65k (UID 0 -> UID 2^16-1) seems to be that reasonable default range size

Is that really necessary, given that most pods use exactly 1 UID?

allows to schedule up to 65k pods in a node

Only if you assume every range is localized to a node. If this was per-SA or
per-NS then it limits you to 64K SAs or NSes, which is actually something
people hit.

What if we gave everyone a smaller set of IDs by default, e.g.
0-1023 + 65520-65535. Then, anyone who really needed more that that would
have to ask for it, and those special cases could get 64K allocations.

Just brainstorming.

We haven't considered a case where a pod can ask for more UIDs. We'd prefer to see a use case that needs it before proposing something similar.

I know some apps do run as ID > 64K. It's not that they need MORE IDs,
they'd need different ones.

We'll need to limit the inside range of UIDs a pod can use to something like 0 -> 65k

That's a breaking change for some users. They need a clean opt-out at least.
That's what the host space is for. I agree.

ServiceAccount gets new "users[]" and "groups[]" fields. These describe the "inside" and are optional.

Do you mean to let the user control the "inside" part of the IDs mapping?

If we end up going as far as an allocator, sure why not?

That's really the only way to ensure no outside collision. Is that a worthwhile goal? Or is the better goal to buy time until node-local ephemeral mappings are viable wrt filesystems? We still have the questions above - standard range size or configurable, etc.

That's a good question. What do you mean with "node-local ephemeral mapping"?

I meant shiftfs or whatever will make FS permissions compatible with user
namespaces. It sounds like linux is as brain-damaged now as it was when I last
looked at userns - years ago. :(

I think we should guarantee that each allocation unit gets a different and non-overlapping range. If we allocate per pod, each kubelet can make this decision, if we allocate per-namespace or per-serviceaccount we'll have to make this decision to an upper layer.

I agree with your conclusion, I am less convinced on your predicate. I am not
sure we NEED to guarantee that each allocation unit gets a different and
non-overlapping range. If we simply ensure a low probability of collision, we
can buy enough time for a "real" solution.

I don't know which solution will be "real" enough or how we would migrate
users, but I am sure we can figure that out...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if a pod gets the 0 100000 65536 mappings and a container inside the pod has USER: 200000, it'll be running as nobody on the host because the range of length 65536 is not long enough to map this ID to the host.

Ops, I was wrong on this statement. If the user the container should run as is not part of the mapping it'll fail to start. Also calls to setuid/setgid within the container trying to set an ID not present on the mapping will fail as well.

In other words, any container trying to run with an UID not present on the mapping will have a hard failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really necessary, given that most pods use exactly 1 UID?

Well, with 64K we’ll cover even more cases. If we go for a per namespace / svc account approach we’ll need to reduce that range as you mentioned above. Finding the right size is important, the bigger the range the less allocations we can do and the higher the probability of a collision, on the other hand, the smaller the range the more applications won’t work.

Then, anyone who really needed more that that would
have to ask for it, and those special cases could get 64K allocations.

I think it’s related to my question about letting the user configure the inside part of the mapping. Probably it’s something we don’t want to do for rev1, asking the user to share the host namespace could be enough for now. For future revisions we could let the user choose the inside part of the mapping, so the IDs they’re using are mapped.

I know some apps do run as ID > 64K. It's not that they need MORE IDs,
they'd need different ones.

Oh yeah, exactly.

I meant shiftfs or whatever will make FS permissions compatible with user
namespaces. It sounds like linux is as brain-damaged now as it was when I last
looked at userns - years ago. :(

Yes, we don’t have more solutions now. shiftfs will probably never be merged. There is some ongoing work by Christian Brauner[1] but unsure what will happen and, assuming the best case that it is merged and when that kernel will be available in distros

[1]: lwn article: https://lwn.net/Articles/837566/. V4 is out now, some days ago (lwn links to v1 or v2)

I agree with your conclusion, I am less convinced on your predicate. I am not
sure we NEED to guarantee that each allocation unit gets a different and
non-overlapping range. If we simply ensure a low probability of collision, we
can buy enough time for a "real" solution.

I’m not sure either. Having non-overlapping ranges now that we can’t guarantee that each pod runs in a different mapping could be an overkill. I agree we can buy time by having a simpler approach, however I’d like to know the opinion of the Red Hat folks as we know they want to have non-overlapping mappings @mrunalp @giuseppe @rhatdan

the `Cluster` mode. This is for discussion with the community and only one will
be considered.

<<[UNRESOLVED where to configure the cluster wide ID mappings ]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to think through how this can change over time. I know you want to say "don't change it" but experience says it will eventually need to be changed.,

- It's difficult to expose this parameter to the kubelet.
- The parameter could not be available for the kubelet if the kube-apiserver is
down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 3 - stick it in Node
Option 4 - make a new resource which is 1:1 with Nodes for this sort of info

@mauriciovasquezbernal mauriciovasquezbernal force-pushed the mauricio/userns_proposal_upstream branch from 3a5c1f7 to 7149f4d Compare December 7, 2020 13:56
@sargun
Copy link

sargun commented Dec 9, 2020

I just ran a casual survey of our nodes. We regularly see UIDs in the 65k+ range. The reason behind this (as far as I can tell), is that people are building stuff on machines that are hooked up to LDAP / active directory. In these systems, UIDs in the 1M+ range are totally normal.

In addition to that, shared CI systems (see: Jenkins) will regularly dynamically allocate high UIDs.

@rata
Copy link
Member

rata commented Dec 9, 2020

@sargun thanks for the examples! In your setups, then, a fixed mapping for pods (configured in the kubelet) will work fine? I expect it will, but some confirmation would be great :)

@thockin
Copy link
Member

thockin commented Dec 10, 2020

@rata:

a fixed mapping for pods (configured in the kubelet) will work fine? I expect it will, but some confirmation would be great :)

Wouldn't that end up mapping all such pods to the overflow UID, thus making them almost as insecure as before? Compromise one, get the rest for free...

@rhatdan
Copy link

rhatdan commented Dec 10, 2020

Yes and no, Just by running inside of a User Namespace means that you loose the system capabilities. So you won't be able to take advantage of DAC_OVERRIDE and DAC_READSEARCH, SETUID and SETGID, to use to reach other containers. Slightly better, but no where as good as running every container in a separate User Namespace, which is what we want to do with CRI-O.

if they need a feature only available in such user namespace, such as loading a
kernel module with `CAP_SYS_MODULE`.

### Notes/Constraints/Caveats
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about devices' permissions (device mounts are set by device plugins in kubernetes)? AFAICS the problem space is similar to volumes but devices don't have the luxury of fsGroups we could use for GID mapping. The additional challenge with devices is that different devices have their own GIDs and we've also seen that certain groups (e.g., video) have different GID value depending on the node OS.

kubernetes/issues/92211 talks about the challenges with non-root users we've tried to solve. The primary goal is to allow containers with non-root user processes access devices. It looks at least Host mode is compatible with that. User namespaces was recently brought up and thus my questions here.

@mtaufen
Copy link
Contributor

mtaufen commented May 20, 2021

Maybe I miss your point? Consider the very first time I create a volume and use it from a pod. My pod sets USER to 1234. Kubelet maps that to "real" UID $BASE+210 (1234 % 1024). I write a file, it gets stored on disk as UID $BASE+210. Then next time I run that pod and access that volume, I must run as the same "real" UID. So we still need to mandate the UID mapping a priori, or we need to take note of it when the first kubelet assigns it the first time (kubelt writing to PV or PVC? Yuck).

I admit, I haven't dug deeply enough to know if we could do something simpler like map the whole pod to one UID for the whole volume, and put that on the volume spec. Maybe I'm just imagining something that's not possible.

@fuweid
Copy link

fuweid commented Jul 3, 2021

/cc

@k8s-ci-robot
Copy link
Contributor

@fuweid: GitHub didn't allow me to request PR reviews from the following users: fuweid.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@thockin
Copy link
Member

thockin commented Sep 3, 2021

Nobody picking this up for 1.23, right?

@rata
Copy link
Member

rata commented Sep 15, 2021

@thockin yes, we would like to pick this up. But the 1.23 KEP freeze is already past us, so we will try to get this into 1.24.

Will keep you posted ;)

@rata
Copy link
Member

rata commented Oct 29, 2021

@thockin @mtaufen

I think the main thing we get from it is that the run-as ranges don't need to be consistent across nodes/clusters anymore, because that can be explicitly specified for storage as needed regardless of run-as IDs. Which means Kubelet can figure out the host mapping locally instead of us needing to mandate it top-down, which means we can make run-as allocation more flexible if we want without worrying about storage.

Maybe I miss your point? Consider the very first time I create a volume and use it from a pod. My pod sets USER to 1234. Kubelet maps that to "real" UID $BASE+210 (1234 % 1024). I write a file, it gets stored on disk as UID $BASE+210. Then next time I run that pod and access that volume, I must run as the same "real" UID. So we still need to mandate the UID mapping a priori, or we need to take note of it when the first kubelet assigns it the first time (kubelt writing to PV or PVC? Yuck).

No, that is not how id mapped mounts work. The UID of files created inside an id mapped mount will be the UID of the user running inside the container. It doesn't matter what that UID the container is mapped to some other UID in the host. The id mapped mount gives us the flexibility to basically use any mapping (non overlapping in particular) and share files between containers just fine, as long as they run as the same user inside the container.

With ID mapped mounts the IDs are mapped using a function, a bijection in particular, that maps from A to B on write and the reverse on read (that is the big trick, it does the reverse for some other operations!). Let's say you use the UID running inside the container is 0. In other words, the pod runAsUser is 0. And this is running in a userns that maps 0 in the container to 1000 in the host.

Then, when UID in the container writes a file in an id mapped mount (and if you use the userns mapping for the id mapped mount mapping param) the file will be created with UID 0. You can technically do other things if you use different mappings, but that is the feature IMHO. We basically can share files, no matter if inside a userns or not.

Game over, no need to worry about mappings when sharing files in userns anymore! There is a talk by Christian Brauner on this years Linux Plumbers conf that explains this in more detail.

However, there are several problems to use this for the containers mounts. One is that this is fs specific, as of Linux 5.15 only ext4, fat, btrfs and xfs support it. Another limitation is that we need to support this in the OCI runtime spec (the bind mounts are done by the low level container runtime) and then on the low level runtimes (like runc). We've been through this process recently for another feature, and it took ~1 year to get a change there and then in runc (still no released version with that feature, though).

So, using id mapped mounts for the container mounts doesn't seem realistic in the short term. And we will always have to deal with the case of a fs that doesn't support id mapped mounts for a long while (well, probably forever, tbh).

I'll post a proposal shortly that should allow us to adopt id mapped mounts in the future (we basically don't need much, we've been thinking about it quite a lot and realize that is easy) and also incorporates the feedback of a per-namespace/per-sa mapping.

@rata
Copy link
Member

rata commented Oct 29, 2021

Hi!

After giving quite some thought to this, I have an idea that should make everyone happy. It should tackle all the requirements/use cases mentioned by @mtaufen, allows for @thockin ideas of per-ns/per-sa mappings and my concerns about not working for several users.

My idea is basically to disentangle enabling user namespaces with having non-overlapping mappings in pods with volumes. The latter is complicated and, as any heuristic, is designed to not work in all use cases.

However, we don’t want users that for some reason our heuristic doesn’t work for them, to not be able to use userns. From a cluster admin POV, not using userns is NOT a minor risk. To name one example this vulnerability, the first cross-account container takeover in the public cloud, would have never happened as described if userns were used in any way (doesn't matter if mappings overlap with other pods or not). The runc breakout (even with vulnerable versions of runc) is not possible with userns, because the user in the container is not root on the host, so it can’t overwrite the /proc/self/exe. This is mentioned in the runc security announcement for this vuln too. Having a way to make userns widely available is something we really want.

So, I think the per-sa/per-ns mapping is a great idea, makes a lot of sense and we want it, but as an iteration to further improve, as a phase 3 (more details below). Not as a requirement to use userns.

More concretely, I propose the following. Of course, names are not final and open for discussion. For simplicity, I’m assuming some names here, but they are not final in any way:

  • Pod spec changes:
    • pod.spec.useHostUsers: bool. If true or not present, uses the host user namespace (as today). If true, a new userns is created for the pod. This field will be used for phase 1 and 2 (more below)
    • pod.spec.securityContex.userns.pod2podIsolation: bool. If enabled, we will make the userns mappings be non-overlapping. This field will be used in phase 3.

The KEP work is divided into phases. Each phase works with more workloads, if a pod with certain characteristics is not yet supported, an error will be shown (i.e. if a pod with volumes activates userns and no volume support is implemented yet, the user will see an error).

Phases:

  • Phase 1 - pods “without” volumes

    • Used on pods with userns bool enabled in pod.spec and using any of these volume types: configmaps, secrets, downward API, projected, emptyDir. If any other type is used, this mode won't be used.
    • Pods use non overlapping mappings, chosen by the Kubelet
    • 64K length, so no guessing needed.
  • Phase 2 - pods with volume

    • Used for pods with userns bool enabled in pod spec and if the pod has any volume that is not the ones in phase 1 (configmaps, secrets, downward API, projected, emptyDir), then this mode is automatically used instead.
    • 64K length mapping (no guessing needed), chosen by the kubelet
    • The mapping will be the same for all pods in this category.
  • Phase 3 - pods with volume and inter-pod isolation

    • Activated when the “pod2podIsolation” field in the pod.spec is set to true and userns is activated too
    • Per ns or per sa mappings with way less than 64k uids exposed. Heuristics TBD

This way, we allow for all to happen. Almost all pods can enable userns, if pods that have volumes want more isolation, they can have it too. If the restrictions for such isolations don’t work, the cluster admin doesn’t sacrifice using userns for pods.

There are no “modes” the user needs to select. All added fields are bools and quite simple. And all can be controlled with admission webhooks and such, if the cluster admin wants to enforce some.

I’m sharing the high-level idea here, but this is definitely not simple as it sounds. Just to make sure we are all in the same page about the complexity of the change, here are some issues that might not be obvious we also need to solve:

  • When we create a container with a different userns mapping, the container runtimes needs to chown the files of the base image so the user running the container can access the rootfs. This creates a performance and storage overhead for every launched container. In fact, we did some internal tests and the overhead is so big without overlaysfs metacopy feature that it seems like a showstopper not having that feature. And containerd doesn’t support the metacopy overlayfs param, so we will need to add support in containerd as well. CRI-O, IIRC when talking with @giuseppe, supports metacopy, so that is done. And no, we can’t use id mapped mounts yet (it is not supported to use an id mapped as lowerdir in an overlayfs. We plan to use it to reduce the overhead even more in the future, as this feature progresses and the kernel adds support, though).

  • We need to make the CRI changes and coordinate with container runtimes when userns is introduced.

  • We also created some fixes for runc, as bind mounts can’t work when running in a userns and the source is in a directory like /va/lib/kubelet that doesn’t have permissions for others. This was just merged, though. Here is the changelog entry that will probably be present in the runc 1.1, in case you want more details: Open bind mount sources from the host userns opencontainers/runc#2576 (comment)

  • We of course need to check with other kinds of runtimes, like VM-runtimes, too. Although the CRI changes to introduce userns can probably be interpreted as an offset for VM runtimes (like runAsUser: 10 and mapping: 0 in the container to 1000 in the host, means that the effective UID should be 1010), we need to check with them.

  • These fields should probably integrate nicely with the PSP replacement and we can teach fsGroupChangePolicy to adjust the permissions taking into account not just the RunAsUser, but the effective UID it is used with volumes. So, users that can afford to chown their volumes, can migrate easily to userns.

There are several more details that I have in mind, but wanted to share these with you just to make sure we are all on the same page that this is not a trivial change. In fact, as the KEP links, we have a currently working PoC of an earlier version of this KEP. And even with that, we estimate about 6 months working full time to get the first 2 phases of this proposal.

For those reasons, I also think that disentangling userns enablement and those complicated heuristics is a win. In fact, I’m not sure if phase 3 is a KEP on its own, due to the complexity. But I’m fine keeping it in the same KEP until we do know for certain how much work it adds.

@thockin @mtaufen and others reading this: What do you think of the high level idea of the 3 phases? I think it is a nice way to incorporate the feedback in these discussions, but let me know what you think :)

I’ve added the topic to the next sig-node meeting (on Nov 2). I will probably prepare some slides to present this same high level idea there, to gather more input from the community.

Thanks again for the very valuable feedback and ideas, really. It helps A LOT! :)

@giuseppe
Copy link
Member

I still believe that we could entirely implement the user namespace support in the CRI runtime with just minor changes to the Kubelet.

We've been using user namespaces in CRI-O through an annotation for more than a year now, and I've not heard many complaints about it.

As long as the IDs specified for RunAsUser UID/GID and fsGroup are injected into the user namespace, the Kubelet should not notice any difference whether they are used inside of a user namespace or directly on the host.

The CRI runtime has more information available to choose a better range of IDs to allocate to the pod. CRI-O, for example, can inspect the image to see what IDs are used (either through image files ownership or looking into the /etc/passwd file).

Image inspection has some limitations right now that could be solved by changing the CRI to create all the containers with a single request. In this way, the CRI runtime can inspect all the images before deciding the user namespace size, which is not possible right now since the user namespace must be created before all the containers are created or even known to the CRI runtime.

For backward compatibility, RunAsUser might mean both the user on the host and in the container by setting an identity mapping. Similarly, we can do something similar for fsGroup.

If an explicit RunAsHostUser is specified, RunAsUser means the UID inside the container only.

Do you foresee any issue with this approach?

@rata
Copy link
Member

rata commented Nov 2, 2021

@giuseppe thanks for chiming in! Let me emphasize that it will be great if you want to get more involved with this. Can you? Do you want to elaborate more on that idea and post a more complete proposal? I leave some observations here, but lots are guessing as details are scarce for me to understand exactly what you propose.

As long as the IDs specified for RunAsUser UID/GID and fsGroup are injected into the user namespace, the Kubelet should not notice any difference whether they are used inside of a user namespace or directly on the host.

I understand that with this you mean: if pod has RunAsUser X, you map the user X in the container to the user X in the host. Is this correct?

In that case, I understand that you say this will work as the effective UID (EUID) used by the container is the same with/without userns (only when all the processes in the container are using that user, though). Therefore, there are no chowns required, etc. in the simple case.

I like the simplicity of the approach, but it has several downsides that @thockin, @mtaufen and others highlighted that we want to avoid as much as possible:

  • pods mappings overlap
  • The user decides which UID is used on the host, so they can easily attack others

These two are not trivial, at all. One consequence is, for example, that all the subpaths k8s vulns are not really mitigated with that approach, as the user can choose with host UID to use and therefore, to read the host file they want. Like CVE-2021-25741 and CVE-2017-1002101.

Am I missing something? I guess I am, but I can’t see it.

However, if the user uses “RunAsUser 0” do you also propose to map it to host UID 0? In that case, there are more problems:

  • The pod is now privileged on the host
  • It can exploit all the subpath vulns as well
  • It can exploit vulns such as the runc CVE-2019-5736, etc.

A HUGE number of docker images run as root, so I expect this case to be quite common.

There are other potential downsides or things that are not really clear to me on what you propose:

  • Will it work with secrets/configmaps mounted as volumes? The kubelet creates those files… Does the kubelet honor fsGroup on these files and that is why it works? Or how does it work exactly?
  • If the container creates a file with user X+1, has RunAsUser X (say containers running a legacy workload, or using systemd as entrypoint, etc.), how is it guaranteed that it will be able to read that file if the pod is re-scheduled or share it with other pods? How is the mapping guaranteed to be stable for that pod outside of the RunAsUser/RunAsGroup/friends fields? Not sure that we can do that if the mapping is chosen in the CRI runtime, we just don’t know about pods nor mapping stability at this layer.
  • Pod behavior should be consistent across CRI runtimes. How is that ensured? CRI runtime will scan the image and decide a mapping and that will be consistent across all CRI runtimes (VM runtimes, etc.)?
  • It will be impossible to use things like fsGroupChangePolicy for volumes in the future, if we happen to need it, as mappings are not controlled by the kubelet. Nor expose the mapping as an object, if that becomes needed

Taking into account that, under this proposal, basically the user can decide which UID runs on the host and mappings to overlap between pods... What is the advantage of doing this instead of a “node wide remap”? With node wide remap, I mean each pod has its own userns but the mapping for all pods is the same.

That has almost the same benefits, right? Except that with a node wide remap we can make pods unprivileged (mitigate the runc breakout and similar future vulns) and mitigate the subpath and others, by using a range that doesn’t overlap with the host fs-UIDs. At the expense of a one-time chown on the volumes, of course.

Am I missing something?

On a completely unrelated note, if my understanding is correct, what you are saying is also encouraging for phase 3! Let me see if I got that right: basically users of CRIO using userns with an annotation probably rely on the RunAsUser/RunAsGroup and such for sharing files in volumes (IIUC those are the UIDs that are not changed in CRIO when running with userns). In that case, maybe it is ok to expose a shorter mapping (like 4k or less) like we want in phase 3 and several workloads will work just fine. It seems quite a lot of users will benefit from that, which is great :)

The CRI runtime has more information available to choose a better range of IDs to allocate to the pod. CRI-O, for example, can inspect the image to see what IDs are used (either through image files ownership or looking into the /etc/passwd file).

Yeap, but we need the userns mapping at the pod sandbox time, right? When the runtime hasn’t downloaded the container images yet.

Image inspection has some limitations right now that could be solved by changing the CRI to create all the containers with a single request. In this way, the CRI runtime can inspect all the images before deciding the user namespace size, which is not possible right now since the user namespace must be created before all the containers are created or even known to the CRI runtime.

Right, maybe we can combine? Like, for phase 3 (under pod2podIsolation field) the mapping can be decided by the CRI runtime and notified to the Kubelet?

I would like to avoid something so complex, though. But this is why I think doing this at phase 3 under the inter-pod isolation feature is such a big win. We can polish on a step by step basis.

Can you elaborate on the CRI changes you propose? Both high level and the gRPC interface details. Like, do you expect the new method to be used only with userns? Or have both been used in parallel (one when creating pods without userns and the other with)? What would the new method look like, exactly?

For backward compatibility, RunAsUser might mean both the user on the host and in the container by setting an identity mapping. Similarly, we can do something similar for fsGroup.

But it is a huge risk allowing end users to choose which host UID their things will run, right? Sure we do that today, but we want to improve on that.

I mean, the point of not allowing mappings to overlap is totally void with that. See the comments before on actual CVEs exploitable if we do this.

If an explicit RunAsHostUser is specified, RunAsUser means the UID inside the container only.

I remember we talked about RunAsHostUser before :). If we use a per-sa or per-ns, I don’t think it will be needed.

Maybe we can add it in the future, if there is a need to share volumes across ns or service account and we need to specify to use the same effective UID? But only if it is really really needed, I’d say.

Do you foresee any issue with this approach?

I mentioned some issues I foresee, but based on assumptions of what you propose because it was not quite clear to me. Let me know if my assumptions were wrong.

Furthermore, would you like to elaborate more on what you propose and how it will work in different cases? That would be great.

If you prefer, we can meet to brainstorm some things too. It will be great if you can get way more involved. Do you want/can? :)

@rata
Copy link
Member

rata commented Nov 2, 2021

@endocrimes this is the current PR about userns we discussed in sig-node today. Thanks! (you were danielle at slack, right?)

I will probably open a new PR with all the feedback from here in a few days, not sure if we need to come some other agreement before doing that.

@endocrimes
Copy link
Member

@rata thanks - and yep!

@rata
Copy link
Member

rata commented Nov 2, 2021

The slides I presented on today's sig-node are here: https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gc6f73a04f_0_0

rata added a commit to kinvolk/kubernetes-enhancements that referenced this pull request Nov 29, 2021
This commit provides the high level overview we discussed in the
SIG-node meeting on Nov 2.

This incorporates the valuable feedback in the discussion at PR kubernetes#2101.

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk/kubernetes-enhancements that referenced this pull request Nov 30, 2021
This commit adds the high level overview I proposed in the SIG-node
meeting on Nov 2. The work divided in phases and intial support (phase 1
and 2) is disentangled from further improvements that community members
wanted to see (phase 3).

This incorporates the valuable feedback in the discussion at PR kubernetes#2101,
making things as automatic as possible and adding a phase 3 for such
improvements, while it also leaves room for future improvements too.

Slides used in the Nov 2 SIG-node meeting are here:
	https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gc6f73a04f_0_0

Closes: kubernetes#2101

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk/kubernetes-enhancements that referenced this pull request Nov 30, 2021
This commit adds the high level overview I proposed in the SIG-node
meeting on Nov 2. The work divided in phases and intial support (phase 1
and 2) is disentangled from further improvements that community members
wanted to see (phase 3).

This incorporates the valuable feedback in the discussion at PR kubernetes#2101,
making things as automatic as possible and adding a phase 3 for such
improvements, while it also leaves room for future improvements too.

Slides used in the Nov 2 SIG-node meeting are here:
	https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gc6f73a04f_0_0

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
@rata
Copy link
Member

rata commented Nov 30, 2021

Opened a new PR that incorporates all the feedback from the discussion here: #3065

Please have a look
cc @thockin @mtaufen @endocrimes @mikedanese and anyone reading here :)

If that one looks good, we can probably merge it and close this one.

@mtaufen
Copy link
Contributor

mtaufen commented Dec 14, 2021

Thanks @rata for all the detailed follow up. I've been trying to get back to this one and keep getting pulled into other shorter term things... sorry about the radio silence. I will try to take a look through the new proposal soon.

I think the three-phases thing is great. The complexity clearly ramps up at each subsequent phase, and each of those milestones is a great place to re-evaluate the progress based on what we learned for the previous. I'd love for phase 1 to get started, that seems like the simplest/least-controversial piece and it would be good to get the implementation moving so we can learn.

@rata
Copy link
Member

rata commented Dec 20, 2021

@mtaufen Thank you!

@giuseppe
Copy link
Member

@mrunalp @derekwaynecarr PTAL

@rata
Copy link
Member

rata commented Jan 10, 2022

Please note this is the old kep, the new one that obsoletes this old proposal is: #3065

@rata rata mentioned this pull request May 23, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.