-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keps/127: Support User Namespaces #2101
keps/127: Support User Namespaces #2101
Conversation
Hi @mauriciovasquezbernal. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mauriciovasquezbernal The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are UID ranges for Pods a resource that the scheduler should account for?
Pathologically: If a Node decides to assign 2^29 UIDs to each pod than the scheduler could place more Pods there than actually fit.
It's something we have thought about but we haven't gone into details as it's not part of the phase 1. I'm not sure it's needed to involve the scheduler here as the IDs is a not a very constrained resource, a Linux node has |
@mauriciovasquezbernal it makes sense to start with 65536 sized ranges in the first phase and then adjust that if needed based on feedback. We should call that out in the kep. |
Makes sense. Added a note about it. |
69b4f9f
to
3a5c1f7
Compare
and group IDs. Kubelet creates the files for those volumes and it can easily set | ||
the file ownership too. | ||
|
||
Non-ephemeral volumes are more difficult to support since they can be persistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PVC, as it shares inside namespace, would it be good to have a new Namespaced mode, which every pod in a namespace shares the same uidmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's something that was proposed a while ago too https://github.com/kubernetes/community/pull/2595/files#diff-473657a51a9a6deb70241820b28201d0653f398f9f0736a47f0d23ca57864fa6R56.
We don't see big benefits on this approach as the isolation is partially and pods inside the same namespace are still not isolated among them. It also complicates the design as there is a new mode where ranges have to be allocated to.
We'd prefer to avoid this mode and see if we are able to cover most user needs with the Cluster mode. If we realize it's really needed we could do a further iteration on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think namespaced mode is way more valuable than "cluster" mode. I'm not very keen on cluster mdoe at all.
I'd go one further - we don't really want namespace mode, we want ServiceAccount mode. What if we can allocate one cluster-unique userID per ServiceAccount? Any pod running as that SA gets that host-UID when running with userNS turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that "namespace" and "service-account" modes are interesting ideas.
The main goal we're trying to accomplish with this KEP is to increase pod-to-host isolation. We know that increasing pod-to-pod isolation would be also great but by doing the former we'll protect the host at least.
The most difficult part of supporting user namespaces are the volumes, the lack of an ID shift (or similar) mechanism in the Linux kernel makes it so difficult to share a volume among pods with different mappings. If such mechanism were implemented probably we're not having this conversation.
ClusterMode
is an easy way to solve these volumes issues, yes, it's not a perfect solution but enables us to use user namespaces in some cases. With the new proposal we're doing in the other comment we're trying to keep this as simple as possible by enabling user namespaces in some basic cases and letting the possibility to cover more cases in future iterations.
I think trying to cover too many cases from the beginning is very difficult, maybe we can start by the basic ones and in the meanwhile keep and eye on new kernel proposals like id-mapped mounts that would definitely help with the more complicated ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with starting with basic, broader protections given the difficulties and open questions of finer-grained approaches, and that it's a very positive step to improve Pod-Host isolation. This proposal seems like a good start on that.
@mauriciovasquezbernal One question I have (curiosity more than anything) is what risks are inherent to the Cluster mode? Part of the reason for protecting the host is to protect Pods from each other, how much can I do to other Pods if I'm in the same user ns? I assume (maybe naïvely) the same things as if I broke out of a container and was in host user ns today? There are ofc many other reasons to protect the host so I think Cluster mode is still valuable as a first step, but I am curious.
@thockin Can we imagine any scenarios where two Pods running as distinct service accounts would want to write to the same PVC? Maybe that's just a bad architectural choice and we should discourage it anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtaufen before answering, I want to clarify that the Cluster mode intends to use the same ID mapping but pods are in different user namespaces. Said that, if you are able to break out into the host user namespace you'll be able to read/write files of other pods, send signals, change scheduling parameters and other things described in the "User and group identifiers" section of https://man7.org/linux/man-pages/man7/credentials.7.html. It's a similar situation as today if you run pods as non-root and you break into the host: you'll have a process running as non-root on the host. On the other hand, if you run pods as root the situation would be very different, you'll have a root process in the host and will be able to compromise all pods too, user namespaces mitigate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, makes sense.
|
||
- **Pod**: | ||
Each pod is placed in a different user namespace and has a different and | ||
non-overlapping ID mapping across the node. This mode is intended for stateless pods, i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or stateful pods that don't have multiple-access volumes (which seems to be most volumes, anecdotally) right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to worry about reclaimPolicy: Retain
in addition to multi-access.
How many stateful scenarios do we enable by using service account as the atom instead of pod (i.e. all pods running as a service account have the same mapping, the mapping is unique to the service account)? There are some other benefits to mapping based on service account too, like better uid information in the audit subsystem logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw the discussion below about a per-(k8s-)namespace mode and IIUC similar benefits of per-service-account would apply to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure Retain matters so much - the volume is "done". It's not saved for later use, it's pending some admin action to destroy it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with stateful pods, even if they access non-shared volume, is that the pod would have to get the same mappings each time it's scheduled, otherwise it'll be running with a different IDs on the host each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is why Namespace or ServiceAccount modes are so attractive :)
use the same ID mappings. This mode doesn't provide full pod-to-pod isolation | ||
as all the pods with `Cluster` mode have the same effective IDs on the host. | ||
It provides pod-to-host isolation as the IDs are different inside the | ||
container and in the host. This mode is intended for pods sharing volumes as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically this is for multi-reader/multi-writer volumes where 2 pods SIMULTANEOUSLY access a volume, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and even then, we have fsGroup which should be a better option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW - do we REALLY need this half-baked mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn’t have to be simultaneously. If a pod writes a file to a persistent volume with mode 0600, then a new pod would have issues accessing that file if the ID mappings of both pods are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fsGroup can definitely help, but not for some cases. fsGroup has some limitations too. It is based on the assumption that all files written to the volume have read / write access by that supplemental group. If a file is written with a 0600 mode it won't work. It also doesn’t work for all volume types, for instance NFS doesn’t allow to change the ownership of the files when the no_root_squash option is not specified in the mount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to clarify, moving the mode (cluster, pod) away from the pod spec seems like a good improvement to the KEP, right? Did I follow that correctly?
I am not sure - I think we need to play out the rest to know the answer here.
We proposed one strategy that is a “fixed” mapping, that is: each pod has its own user namespace but all pods share the same mapping.
This is utterly pragmatic, but it only addresses 1 requirement - protect the
node from pods. Any attack that escapes one pod automatically compromises all
other pods on that node. I feel like we can do better.
However, Red Hat for example (cc @mrunalp ) wants mappings where non-overlapping ranges are given to pods. That is, no two pods are running in the same node with the same mappings
IMO this is a step too far unless the FS-related issues are addressed. As a
final goal, it sounds great. But it suffers the same problems described
elsewhere - the inside space and the outside space are the same size, which
means you can't GUARANTEE that it works for any arbitrary 2^32 UID.
What I like about the fixed mapping is that it is super simple, works for MOST use cases (even sharing volumes with different namespaces) but gives you WAY more security that what we currently have without user namespaces.
I can't agree with "WAY more". A bit more, yes.
I’m not convinced about per-service-account mapping: if you want to share a volume, you also need to share a service account and if the pods are using the service account for something, now it needs to have the union of all the permissions? Seems too hacky and a trade-off in security (you use user ns, that is good thing security wise, but now your service accounts have more permissions than needed for some pods)
I have a STRONG feeling that pods which are collaborating on a volume will tend
to have the same ServiceAccount. I don't have hard data for that. That said,
it feels like a relatively small increase in complexity over fixed mapping with
a low (IMO) risk of failure) and much better security. The good news is that
there's a clean "out" in the form of hostUserNamespace
which we could advise
while we consider other options.
I’m sure these limits will be a problem in the future and will come back to bite us_
The more "niche" the use-case, the less bad I feel about telling people to use
the host namespace :)
and I’m also not quite sure how the implementation of this would be (what a “cluster-unique” allocation for the mapping per ns looks like?).
The simple answer is to take the UID of the SA (or the hash of the UID) %
(2^32 - 2) (because -1 is reserved) & ~(0xffff). That gives us approximately
64K ID-spaces of 64K IDs each. If we don't need that many IDs, We can shift it
to something like 4 million ID-spaces of 1024 Ids each. Yes there's a chance
of collision, but low. Yes, it still ignores the fact that any given pod could
use IDs outside the range we assign, regardless of what range we assign.
The complete solution involves new API resources and allocation and things
like that. I am disinclined to devise the solution until we think we really
need it.
I’d like to know what he thinks, as he really wanted non-overlapping mappings.
If we thought that a solution to the volumes problem was less than a year away,
we could take any one of these as a stepping-stone. At that point, the less
surface area, the better. Especially since this is opt-in and any solution
is better than what we have today.
It probably makes sense to configure which ranges to use for any strategy (per-ns, fixed, non-overlapping or whatever), so one configuration in the Kubelet probably can be added. Or do you mean to have sensible default values for the ranges? Like use UID 10 millon or $SOMETHING_USUALLY_UNUSED on the host, as that is a good default for most systems and you can change it if you want? That SGTM too.
If we need a value that is consistent across machines, never changes, and is
not something a user would need to tweak - sure. E.g. the overflow UID. As
soon as we tell users "go tweak this to change how your users experience
kubernetes" or "set this to improve security", flags are just awful.
I think we should aspire to either a) totally automatic or b) exposed via the
API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is a step too far unless the FS-related issues are addressed. As a
final goal, it sounds great. But it suffers the same problems described
elsewhere - the inside space and the outside space are the same size, which
means you can't GUARANTEE that it works for any arbitrary 2^32 UID.
in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.
We don't handle yet arbitrary IDs in the image files set to higher values than what is available in the inner user namespace, e.g. given an allocation of 65k IDs, a file owned by 200000:200000 will show up as owned by the overflow ID.
I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.
Another possibility would be to create a per-pod user namespace and then each container runs in their own sub user namespace, child of the pod user namespace; but that would probably require changes in the OCI runtime as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.
I'm curious, how does it work when the container run as root?
I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.
I think there are two different scenarios. One is to guarantee that a mapping for the files on the container image exist, that's something that could be implemented. The other case are IDs on the fly that a container could try to use, that seems very complicated to cover IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in CRI-O we punch holes for runAsUser/runAsGroup/fsGroup and map them back to their value on the host, so anything that relies on these values will still keep working.
I'm curious, how does it work when the container run as root?
with the current annotations mechanism, root is treated as any other user so a container that specifies runAsUser=0
will be mapped to root on the host, since that is the assumption from the outside. It is possible though to use runAsUser=
with a nonzero UID and specify to map it to root in the user namespace.
I think this issue could also be addressed in the runtime/kubelet. The available IDs in the userns can be allocated in a non contiguous way to please the files present in the image. What is preventing to address this issue right now in CRI-O is that the IDs must be allocated when the sandbox is created and the containers/images running in that pod are not known in this phase.
I think there are two different scenarios. One is to guarantee that a mapping for the files on the container image exist, that's something that could be implemented. The other case are IDs on the fly that a container could try to use, that seems very complicated to cover IMO.
I don't think the second case can be solved considering how user namespaces work now. The assumption is that the container, even if running with CAP_SETUID/CAP_SETGID, will still be limited to use the IDs that are present in the image, either through files ownership or users defined in /etc/passwd
.
If a container needs access to any of the 2^32 UIDs, then IMO it should not run in a user namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin thanks! responses inline.
So, to clarify, moving the mode (cluster, pod) away from the pod spec seems like a good improvement to the KEP, right? Did I follow that correctly?
I am not sure - I think we need to play out the rest to know the answer here.
Heh, okay, let’s wait. But if we don’t want to create a new API object to specify the mapping (as you mentioned and I agree not doing that now) and we want to reduce the surface we expose, I think this bool to use host userns or not should be there for sure, and probably not much more (maybe some other thing, but hopefully we can just enable userns and that is all you need to do).
But let’s see what we arrive at :)
This is utterly pragmatic, but it only addresses 1 requirement - protect the node from pods. Any attack that escapes one pod automatically compromises all other pods on that node. I feel like we can do better.
Heh, yes, this is pragmatic indeed. There is probably some more isolation between pods as they run in different user namespaces (like can’t read the mem of a pod in a different userns even if /proc is exposed). But yeah, if a breakout happens, all pods in the node can be compromised.
The other upside non-mentioned of a fixed mapping is that it works with containers that need big/long mappings too, as we can allow almost all the 2^32 UID range inside the container. In contrast, in the case of per-ns/per-sa we need to guess which mappings will be used and hope for the best or ask to use the host user namespace in those cases (or expose an API object in k8s with the ranges needed or have some metadata about the containers with the used UIDs, etc.).
Those images with high UIDs that can’t be easily guessed happen to be our use case too. Although I see that maybe it is not an issue for several users and the appeal of per-sa or per-ns.
However, Red Hat for example (cc @mrunalp ) wants mappings where non-overlapping ranges are given to pods. That is, no two pods are running in the same node with the same mappings
IMO this is a step too far unless the FS-related issues are addressed. As a final goal, it sounds great. But it suffers the same problems described elsewhere - the inside space and the outside space are the same size, which means you can't GUARANTEE that it works for any arbitrary 2^32 UID.
Good point, fully agree. Until the kernel improves in some way (supports 64 bits kuids or something) and we can properly emulate userns of 2^32 UIDs, but not for now.
I have a STRONG feeling that pods which are collaborating on a volume will tend to have the same ServiceAccount. I don't have hard data for that. That said, it feels like a relatively small increase in complexity over fixed mapping with a low (IMO) risk of failure) and much better security. The good news is that there's a clean "out" in the form of hostUserNamespace which we could advise while we consider other options.
Good point, yes, it seems a reasonable trade off regarding complexity and security. I’m unsure if they will share a SA, if you consider volumes that are typically written by many (like NFS), but I’m willing to find out :-D. I have a soft preference for per-ns (maybe being utterly pragmatic again?), but both very good ideas, thanks!
However, for our use case we have containers with high IDs that can’t be easily guessed. For that reason some way to have a fixed mapping will be very useful for us, so we can specify one long mapping to be used by all.
If you think that any configuration knob, to support containers with a wide range of UIDs is a complete no-go, we would probably continue with the other options (as the patch we need to do for Kubernetes with a fixed mapping will be simpler than today). But if you think it is okay to allow that in some way in early revisions, I’d really like to know which ways you think might be okay for that :)
and I’m also not quite sure how the implementation of this would be (what a “cluster-unique” allocation for the mapping per ns looks like?).
The simple answer is to take the UID of the SA (or the hash of the UID) % (2^32 - 2) (because -1 is reserved) & ~(0xffff). That gives us approximately 64K ID-spaces of 64K IDs each. If we don't need that many IDs, We can shift it to something like 4 million ID-spaces of 1024 Ids each. Yes there's a chance of collision, but low. Yes, it still ignores the fact that any given pod could use IDs outside the range we assign, regardless of what range we assign
Oh, good point! Yes, I like a simple solution like that: doesn’t guarantee, but is quite low chances of collision (at least for me, that I’m not an expert on that matter). Thanks!
The complete solution involves new API resources and allocation and things like that. I am disinclined to devise the solution until we think we really need it.
I fully agree.
The only “downside” that I can find is not a big one: you can’t easily know the mapping in advance, so you can’t just add to your yaml “useHostUserNs: false” or you can’t easily migrate volumes to a new cluster (the UID will change), you will need to adjust (one time) your files permissions when switching to userns. Maybe this is even trivial as maybe most cases can use fsGroupChangePolicy
when doing so and others can adjust as they need?
In any case, I don’t think that is a big problem and applies to any way of using userns with volumes without exposing the mapping. I agree we don’t want to expose the mapping, at least for now, so a downside we will have to live with (and not so bad, IMHO).
If we thought that a solution to the volumes problem was less than a year away, we could take any one of these as a stepping-stone. At that point, the less surface area, the better. Especially since this is opt-in and any solution is better than what we have today.
I think it is really hard to know when the kernel will improve on that. But definitely, less surface makes sense and anything is better than what we currently have, fully agree on that :). There is ongoing work on the kernel front, hopefully[1].
If we need a value that is consistent across machines, never changes, and is not something a user would need to tweak - sure. E.g. the overflow UID. As soon as we tell users "go tweak this to change how your users experience kubernetes" or "set this to improve security", flags are just awful.
I think we should aspire to either a) totally automatic or b) exposed via the API.
Heh, I hear the point of hosted solutions, I wasn’t really aware of this (makes sense, but haven’t heard it before). I think what is coming out of these conversations is quite nice and achieves that. Thanks a lot for stepping in @thockin! :)
[1]: lwn article: https://lwn.net/Articles/837566/. V4 is out now, some weeks ago.
and group IDs. Kubelet creates the files for those volumes and it can easily set | ||
the file ownership too. | ||
|
||
Non-ephemeral volumes are more difficult to support since they can be persistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think namespaced mode is way more valuable than "cluster" mode. I'm not very keen on cluster mdoe at all.
I'd go one further - we don't really want namespace mode, we want ServiceAccount mode. What if we can allocate one cluster-unique userID per ServiceAccount? Any pod running as that SA gets that host-UID when running with userNS turned on.
Non-ephemeral volumes are more difficult to support since they can be persistent | ||
and shared by multiple pods. This proposal supports volumes with two different | ||
strategies: | ||
- The `Cluster` mode makes it easier for pods to share files using volumes when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any idea whether a namespace mode or SA mode would handle the same problem? Do users share volumes across SAs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is SA the right semantics though? Easy to imagine shared access to one resource by multiple principals, generally speaking. Might still be the convenient choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem space feels something like same-origin on the Web, K8s namespace might be the natural analogue...
Also worth thinking about cases where users have two clusters with the same set of namespaces and expect to move workloads between them - could we ensure that the mapping was consistently configured for both clusters, even across concurrent, potentially out-of-order additions/removals of namespaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could derive the mapping from the namespace name, to guarantee it's the same in different clusters. However I don't know the security implications of allowing users to indirectly control the ranges by the namespace name.
Cases like this are another reason why we're want to have the same mapping for all pods with volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we started with something like Cluster mode (which I think is a great first step), it theoretically blocks us from transparently migrating to Namespace mode in the future, since mappings would likely change, right? At that point do we basically have to wait for shiftfs to make things more secure without additional user intervention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the security implications of allowing users to indirectly control the ranges by the namespace name.
It might not be a huge problem, given namespaces are probably created by cluster admins, controlling ranges might fit their privilege level anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we started with something like Cluster mode (which I think is a great first step), it theoretically blocks us from transparently migrating to Namespace mode in the future, since mappings would likely change, right?
Yes, users will have to update the volumes ownership. Btw, one of our goals with this proposal is to have something that works good for now and that allows to easily use a shiftfs-like solution when it'll be available. In one of our comments (#2101 (comment)) we're proposing to have a "Cluster" mode but implicit, to avoid changes to the pod spec.
At that point do we basically have to wait for shiftfs to make things more secure without additional user intervention?
Yes, I think shiftfs (or a similar solution) is the only way to avoid user intervention at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the clarification.
user ID 100000 in the host, so it appears to belong to root in the container. | ||
The current implementation in container runtimes is to recursively perform a | ||
`chown` operation over the image snapshot when it's pulled. This presents a risk | ||
as it potentially increases the time and the storage needed to handle the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and argues for a more stable mapping
possible that the IDs range assigned to the pod is not big enough to accommodate | ||
these IDs, in this case they will be mapped to the `nobody` user in the host. | ||
|
||
It's not a big problem in the `Cluster` case, the users have to be sure that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about ranges - what are they used for and why are they needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Range refers to the ID mapping the Pod gets. It's possible that a pod gets a small range and the UID that it runs with is not covered by that range.
For instance, if a pod gets the 0 100000 65536
mappings and a container inside the pod has USER: 200000
, it'll be running as nobody
on the host because the range of length 65536 is not long enough to map this ID to the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need ranges for pods? Why not a single "outside" UID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that a pod uses multiple IDs inside. It's not possible to map all of those to the same "outside" ID but we need a range. If we only map one ID, then the other IDs will be seen as nobody - 65534
from the host point of view witch creates issues to access files that don't have "others" permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin It is implicit on what @mauriciovasquezbernal is saying but the Linux kernel forces you to have a range (you need a 1-1 mapping, a bijection).
I'd like the same you mention (map all UIDs/GIDs in the container to just one ID X on the host), but it is not possible to do that in Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's simply impossible to do with any guarantee. Each pod might use any UID from the entire space. More concretely, they could use UID 0 and UID 2^32-1 - requiring the whole range to map. The only way to guarantee non-collision on the outside is to limit which UIDs can be used on the inside to a "reasonable" range.
This seems like a pretty fundamental thing we need to design for, no?
Sorry if I am rehashing old discussion - I don't see this really covered here. Are we assuming some "standard" range size for every pod? Do pods get to express "I need more UIDs"?
One of the very fundamental principles of kubernetes is that users should not be subject to collisions that they did not cause. For example, sharing host ports.
Hypothetically: Assume we had to pre-allocate "outside" UIDs and the only UIDs you could use were those - would it be better? What would the unit of allocation-assignment? I think Namespace or ServiceAccount seems right. So if we carry that to the hypothetical end, what if we end up at a model like this:
- ServiceAccount gets new "users[]" and "groups[]" fields. These describe the "inside" and are optional.
- An IDAllocation resource. This is created in response to the above fields and assigns "outside" IDs to them.
- When a pod runs as that SA it gets that UID mapping.
- When the cluster runs out of IDs, no new IDAllocations can be made.
- Allocation must be pluggable.
That's really the only way to ensure no outside collision. Is that a worthwhile goal? Or is the better goal to buy time until node-local ephemeral mappings are viable wrt filesystems? We still have the questions above - standard range size or configurable, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I am rehashing old discussion - I don't see this really covered here. Are we assuming some "standard" range size for every pod? Do pods get to express "I need more UIDs"?
Yes, 65k (UID 0 -> UID 2^16-1) seems to be that reasonable default range size (https://github.com/kubernetes/enhancements/pull/2101/files#diff-2d23f433c4d58d94c34afbadc193009e2f7b4b8eab1c6549f7bd96d81abf142bR399-R407). It should be enough for a pod and also allows to schedule up to 65k pods in a node (far away of what k8s handles these days).
We haven't considered a case where a pod can ask for more UIDs. We'd prefer to see a use case that needs it before proposing something similar.
One of the very fundamental principles of kubernetes is that users should not be subject to collisions that they did not cause. For example, sharing host ports.
I'm not sure what it means for k8s, but with user namespaces the quantity of IDs available for each pod will be reduced because it's a node-wide shared resource. We'll need to limit the inside range of UIDs a pod can use to something like 0 -> 65k, at least for rev 1. If a pod needs to use IDs outside that range it could share the host user namespace anyway.
Hypothetically: Assume we had to pre-allocate "outside" UIDs and the only UIDs you could use were those - would it be better? What would the unit of allocation-assignment? I think Namespace or ServiceAccount seems right. So if we carry that to the hypothetical end, what if we end up at a model like this:
I think you described the allocation algorithm in a high level detail. In general this algorithm should have a pool of host (outside) IDs it can use (maybe configurable by the operator?), each time an "allocation unit resource" (pod / serviceaccount / namespace depending on the outcome of our discussion) is created, this algorithm assigns a portion of that range to that resource.
ServiceAccount gets new "users[]" and "groups[]" fields. These describe the "inside" and are optional.
Do you mean to let the user control the "inside" part of the IDs mapping?
That's really the only way to ensure no outside collision. Is that a worthwhile goal? Or is the better goal to buy time until node-local ephemeral mappings are viable wrt filesystems? We still have the questions above - standard range size or configurable, etc.
That's a good question. What do you mean with "node-local ephemeral mapping"?
I think we should guarantee that each allocation unit gets a different and non-overlapping range. If we allocate per pod, each kubelet can make this decision, if we allocate per-namespace or per-serviceaccount we'll have to make this decision to an upper layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I am rehashing old discussion - I don't see this really covered here. Are we assuming some "standard" range size for every pod? Do pods get to express "I need more UIDs"?
Yes, 65k (UID 0 -> UID 2^16-1) seems to be that reasonable default range size
Is that really necessary, given that most pods use exactly 1 UID?
allows to schedule up to 65k pods in a node
Only if you assume every range is localized to a node. If this was per-SA or
per-NS then it limits you to 64K SAs or NSes, which is actually something
people hit.
What if we gave everyone a smaller set of IDs by default, e.g.
0-1023 + 65520-65535. Then, anyone who really needed more that that would
have to ask for it, and those special cases could get 64K allocations.
Just brainstorming.
We haven't considered a case where a pod can ask for more UIDs. We'd prefer to see a use case that needs it before proposing something similar.
I know some apps do run as ID > 64K. It's not that they need MORE IDs,
they'd need different ones.
We'll need to limit the inside range of UIDs a pod can use to something like 0 -> 65k
That's a breaking change for some users. They need a clean opt-out at least.
That's what the host space is for. I agree.
ServiceAccount gets new "users[]" and "groups[]" fields. These describe the "inside" and are optional.
Do you mean to let the user control the "inside" part of the IDs mapping?
If we end up going as far as an allocator, sure why not?
That's really the only way to ensure no outside collision. Is that a worthwhile goal? Or is the better goal to buy time until node-local ephemeral mappings are viable wrt filesystems? We still have the questions above - standard range size or configurable, etc.
That's a good question. What do you mean with "node-local ephemeral mapping"?
I meant shiftfs or whatever will make FS permissions compatible with user
namespaces. It sounds like linux is as brain-damaged now as it was when I last
looked at userns - years ago. :(
I think we should guarantee that each allocation unit gets a different and non-overlapping range. If we allocate per pod, each kubelet can make this decision, if we allocate per-namespace or per-serviceaccount we'll have to make this decision to an upper layer.
I agree with your conclusion, I am less convinced on your predicate. I am not
sure we NEED to guarantee that each allocation unit gets a different and
non-overlapping range. If we simply ensure a low probability of collision, we
can buy enough time for a "real" solution.
I don't know which solution will be "real" enough or how we would migrate
users, but I am sure we can figure that out...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, if a pod gets the
0 100000 65536
mappings and a container inside the pod hasUSER: 200000
, it'll be running asnobody
on the host because the range of length 65536 is not long enough to map this ID to the host.
Ops, I was wrong on this statement. If the user the container should run as is not part of the mapping it'll fail to start. Also calls to setuid/setgid
within the container trying to set an ID not present on the mapping will fail as well.
In other words, any container trying to run with an UID not present on the mapping will have a hard failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really necessary, given that most pods use exactly 1 UID?
Well, with 64K we’ll cover even more cases. If we go for a per namespace / svc account approach we’ll need to reduce that range as you mentioned above. Finding the right size is important, the bigger the range the less allocations we can do and the higher the probability of a collision, on the other hand, the smaller the range the more applications won’t work.
Then, anyone who really needed more that that would
have to ask for it, and those special cases could get 64K allocations.
I think it’s related to my question about letting the user configure the inside part of the mapping. Probably it’s something we don’t want to do for rev1, asking the user to share the host namespace could be enough for now. For future revisions we could let the user choose the inside part of the mapping, so the IDs they’re using are mapped.
I know some apps do run as ID > 64K. It's not that they need MORE IDs,
they'd need different ones.
Oh yeah, exactly.
I meant shiftfs or whatever will make FS permissions compatible with user
namespaces. It sounds like linux is as brain-damaged now as it was when I last
looked at userns - years ago. :(
Yes, we don’t have more solutions now. shiftfs will probably never be merged. There is some ongoing work by Christian Brauner[1] but unsure what will happen and, assuming the best case that it is merged and when that kernel will be available in distros
[1]: lwn article: https://lwn.net/Articles/837566/. V4 is out now, some days ago (lwn links to v1 or v2)
I agree with your conclusion, I am less convinced on your predicate. I am not
sure we NEED to guarantee that each allocation unit gets a different and
non-overlapping range. If we simply ensure a low probability of collision, we
can buy enough time for a "real" solution.
I’m not sure either. Having non-overlapping ranges now that we can’t guarantee that each pod runs in a different mapping could be an overkill. I agree we can buy time by having a simpler approach, however I’d like to know the opinion of the Red Hat folks as we know they want to have non-overlapping mappings @mrunalp @giuseppe @rhatdan
the `Cluster` mode. This is for discussion with the community and only one will | ||
be considered. | ||
|
||
<<[UNRESOLVED where to configure the cluster wide ID mappings ]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to think through how this can change over time. I know you want to say "don't change it" but experience says it will eventually need to be changed.,
- It's difficult to expose this parameter to the kubelet. | ||
- The parameter could not be available for the kubelet if the kube-apiserver is | ||
down. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 3 - stick it in Node
Option 4 - make a new resource which is 1:1 with Nodes for this sort of info
3a5c1f7
to
7149f4d
Compare
I just ran a casual survey of our nodes. We regularly see UIDs in the 65k+ range. The reason behind this (as far as I can tell), is that people are building stuff on machines that are hooked up to LDAP / active directory. In these systems, UIDs in the 1M+ range are totally normal. In addition to that, shared CI systems (see: Jenkins) will regularly dynamically allocate high UIDs. |
@sargun thanks for the examples! In your setups, then, a fixed mapping for pods (configured in the kubelet) will work fine? I expect it will, but some confirmation would be great :) |
Wouldn't that end up mapping all such pods to the overflow UID, thus making them almost as insecure as before? Compromise one, get the rest for free... |
Yes and no, Just by running inside of a User Namespace means that you loose the system capabilities. So you won't be able to take advantage of DAC_OVERRIDE and DAC_READSEARCH, SETUID and SETGID, to use to reach other containers. Slightly better, but no where as good as running every container in a separate User Namespace, which is what we want to do with CRI-O. |
if they need a feature only available in such user namespace, such as loading a | ||
kernel module with `CAP_SYS_MODULE`. | ||
|
||
### Notes/Constraints/Caveats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about devices' permissions (device mounts are set by device plugins in kubernetes)? AFAICS the problem space is similar to volumes but devices don't have the luxury of fsGroups
we could use for GID mapping. The additional challenge with devices is that different devices have their own GIDs and we've also seen that certain groups (e.g., video
) have different GID value depending on the node OS.
kubernetes/issues/92211 talks about the challenges with non-root users we've tried to solve. The primary goal is to allow containers with non-root user processes access devices. It looks at least Host
mode is compatible with that. User namespaces was recently brought up and thus my questions here.
I admit, I haven't dug deeply enough to know if we could do something simpler like map the whole pod to one UID for the whole volume, and put that on the volume spec. Maybe I'm just imagining something that's not possible. |
/cc |
@fuweid: GitHub didn't allow me to request PR reviews from the following users: fuweid. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Nobody picking this up for 1.23, right? |
@thockin yes, we would like to pick this up. But the 1.23 KEP freeze is already past us, so we will try to get this into 1.24. Will keep you posted ;) |
No, that is not how id mapped mounts work. The UID of files created inside an id mapped mount will be the UID of the user running inside the container. It doesn't matter what that UID the container is mapped to some other UID in the host. The id mapped mount gives us the flexibility to basically use any mapping (non overlapping in particular) and share files between containers just fine, as long as they run as the same user inside the container. With ID mapped mounts the IDs are mapped using a function, a bijection in particular, that maps from A to B on write and the reverse on read (that is the big trick, it does the reverse for some other operations!). Let's say you use the UID running inside the container is 0. In other words, the pod runAsUser is 0. And this is running in a userns that maps 0 in the container to 1000 in the host. Then, when UID in the container writes a file in an id mapped mount (and if you use the userns mapping for the id mapped mount mapping param) the file will be created with UID 0. You can technically do other things if you use different mappings, but that is the feature IMHO. We basically can share files, no matter if inside a userns or not. Game over, no need to worry about mappings when sharing files in userns anymore! There is a talk by Christian Brauner on this years Linux Plumbers conf that explains this in more detail. However, there are several problems to use this for the containers mounts. One is that this is fs specific, as of Linux 5.15 only ext4, fat, btrfs and xfs support it. Another limitation is that we need to support this in the OCI runtime spec (the bind mounts are done by the low level container runtime) and then on the low level runtimes (like runc). We've been through this process recently for another feature, and it took ~1 year to get a change there and then in runc (still no released version with that feature, though). So, using id mapped mounts for the container mounts doesn't seem realistic in the short term. And we will always have to deal with the case of a fs that doesn't support id mapped mounts for a long while (well, probably forever, tbh). I'll post a proposal shortly that should allow us to adopt id mapped mounts in the future (we basically don't need much, we've been thinking about it quite a lot and realize that is easy) and also incorporates the feedback of a per-namespace/per-sa mapping. |
Hi! After giving quite some thought to this, I have an idea that should make everyone happy. It should tackle all the requirements/use cases mentioned by @mtaufen, allows for @thockin ideas of per-ns/per-sa mappings and my concerns about not working for several users. My idea is basically to disentangle enabling user namespaces with having non-overlapping mappings in pods with volumes. The latter is complicated and, as any heuristic, is designed to not work in all use cases. However, we don’t want users that for some reason our heuristic doesn’t work for them, to not be able to use userns. From a cluster admin POV, not using userns is NOT a minor risk. To name one example this vulnerability, the first cross-account container takeover in the public cloud, would have never happened as described if userns were used in any way (doesn't matter if mappings overlap with other pods or not). The runc breakout (even with vulnerable versions of runc) is not possible with userns, because the user in the container is not root on the host, so it can’t overwrite the /proc/self/exe. This is mentioned in the runc security announcement for this vuln too. Having a way to make userns widely available is something we really want. So, I think the per-sa/per-ns mapping is a great idea, makes a lot of sense and we want it, but as an iteration to further improve, as a phase 3 (more details below). Not as a requirement to use userns. More concretely, I propose the following. Of course, names are not final and open for discussion. For simplicity, I’m assuming some names here, but they are not final in any way:
The KEP work is divided into phases. Each phase works with more workloads, if a pod with certain characteristics is not yet supported, an error will be shown (i.e. if a pod with volumes activates userns and no volume support is implemented yet, the user will see an error). Phases:
This way, we allow for all to happen. Almost all pods can enable userns, if pods that have volumes want more isolation, they can have it too. If the restrictions for such isolations don’t work, the cluster admin doesn’t sacrifice using userns for pods. There are no “modes” the user needs to select. All added fields are bools and quite simple. And all can be controlled with admission webhooks and such, if the cluster admin wants to enforce some. I’m sharing the high-level idea here, but this is definitely not simple as it sounds. Just to make sure we are all in the same page about the complexity of the change, here are some issues that might not be obvious we also need to solve:
There are several more details that I have in mind, but wanted to share these with you just to make sure we are all on the same page that this is not a trivial change. In fact, as the KEP links, we have a currently working PoC of an earlier version of this KEP. And even with that, we estimate about 6 months working full time to get the first 2 phases of this proposal. For those reasons, I also think that disentangling userns enablement and those complicated heuristics is a win. In fact, I’m not sure if phase 3 is a KEP on its own, due to the complexity. But I’m fine keeping it in the same KEP until we do know for certain how much work it adds. @thockin @mtaufen and others reading this: What do you think of the high level idea of the 3 phases? I think it is a nice way to incorporate the feedback in these discussions, but let me know what you think :) I’ve added the topic to the next sig-node meeting (on Nov 2). I will probably prepare some slides to present this same high level idea there, to gather more input from the community. Thanks again for the very valuable feedback and ideas, really. It helps A LOT! :) |
I still believe that we could entirely implement the user namespace support in the CRI runtime with just minor changes to the Kubelet. We've been using user namespaces in CRI-O through an annotation for more than a year now, and I've not heard many complaints about it. As long as the IDs specified for The CRI runtime has more information available to choose a better range of IDs to allocate to the pod. CRI-O, for example, can inspect the image to see what IDs are used (either through image files ownership or looking into the Image inspection has some limitations right now that could be solved by changing the CRI to create all the containers with a single request. In this way, the CRI runtime can inspect all the images before deciding the user namespace size, which is not possible right now since the user namespace must be created before all the containers are created or even known to the CRI runtime. For backward compatibility, If an explicit Do you foresee any issue with this approach? |
@giuseppe thanks for chiming in! Let me emphasize that it will be great if you want to get more involved with this. Can you? Do you want to elaborate more on that idea and post a more complete proposal? I leave some observations here, but lots are guessing as details are scarce for me to understand exactly what you propose.
I understand that with this you mean: if pod has RunAsUser X, you map the user X in the container to the user X in the host. Is this correct? In that case, I understand that you say this will work as the effective UID (EUID) used by the container is the same with/without userns (only when all the processes in the container are using that user, though). Therefore, there are no chowns required, etc. in the simple case. I like the simplicity of the approach, but it has several downsides that @thockin, @mtaufen and others highlighted that we want to avoid as much as possible:
These two are not trivial, at all. One consequence is, for example, that all the subpaths k8s vulns are not really mitigated with that approach, as the user can choose with host UID to use and therefore, to read the host file they want. Like CVE-2021-25741 and CVE-2017-1002101. Am I missing something? I guess I am, but I can’t see it. However, if the user uses “RunAsUser 0” do you also propose to map it to host UID 0? In that case, there are more problems:
A HUGE number of docker images run as root, so I expect this case to be quite common. There are other potential downsides or things that are not really clear to me on what you propose:
Taking into account that, under this proposal, basically the user can decide which UID runs on the host and mappings to overlap between pods... What is the advantage of doing this instead of a “node wide remap”? With node wide remap, I mean each pod has its own userns but the mapping for all pods is the same. That has almost the same benefits, right? Except that with a node wide remap we can make pods unprivileged (mitigate the runc breakout and similar future vulns) and mitigate the subpath and others, by using a range that doesn’t overlap with the host fs-UIDs. At the expense of a one-time chown on the volumes, of course. Am I missing something? On a completely unrelated note, if my understanding is correct, what you are saying is also encouraging for phase 3! Let me see if I got that right: basically users of CRIO using userns with an annotation probably rely on the RunAsUser/RunAsGroup and such for sharing files in volumes (IIUC those are the UIDs that are not changed in CRIO when running with userns). In that case, maybe it is ok to expose a shorter mapping (like 4k or less) like we want in phase 3 and several workloads will work just fine. It seems quite a lot of users will benefit from that, which is great :)
Yeap, but we need the userns mapping at the pod sandbox time, right? When the runtime hasn’t downloaded the container images yet.
Right, maybe we can combine? Like, for phase 3 (under pod2podIsolation field) the mapping can be decided by the CRI runtime and notified to the Kubelet? I would like to avoid something so complex, though. But this is why I think doing this at phase 3 under the inter-pod isolation feature is such a big win. We can polish on a step by step basis. Can you elaborate on the CRI changes you propose? Both high level and the gRPC interface details. Like, do you expect the new method to be used only with userns? Or have both been used in parallel (one when creating pods without userns and the other with)? What would the new method look like, exactly?
But it is a huge risk allowing end users to choose which host UID their things will run, right? Sure we do that today, but we want to improve on that. I mean, the point of not allowing mappings to overlap is totally void with that. See the comments before on actual CVEs exploitable if we do this.
I remember we talked about RunAsHostUser before :). If we use a per-sa or per-ns, I don’t think it will be needed. Maybe we can add it in the future, if there is a need to share volumes across ns or service account and we need to specify to use the same effective UID? But only if it is really really needed, I’d say.
I mentioned some issues I foresee, but based on assumptions of what you propose because it was not quite clear to me. Let me know if my assumptions were wrong. Furthermore, would you like to elaborate more on what you propose and how it will work in different cases? That would be great. If you prefer, we can meet to brainstorm some things too. It will be great if you can get way more involved. Do you want/can? :) |
@endocrimes this is the current PR about userns we discussed in sig-node today. Thanks! (you were danielle at slack, right?) I will probably open a new PR with all the feedback from here in a few days, not sure if we need to come some other agreement before doing that. |
@rata thanks - and yep! |
The slides I presented on today's sig-node are here: https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gc6f73a04f_0_0 |
This commit provides the high level overview we discussed in the SIG-node meeting on Nov 2. This incorporates the valuable feedback in the discussion at PR kubernetes#2101. Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
This commit adds the high level overview I proposed in the SIG-node meeting on Nov 2. The work divided in phases and intial support (phase 1 and 2) is disentangled from further improvements that community members wanted to see (phase 3). This incorporates the valuable feedback in the discussion at PR kubernetes#2101, making things as automatic as possible and adding a phase 3 for such improvements, while it also leaves room for future improvements too. Slides used in the Nov 2 SIG-node meeting are here: https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gc6f73a04f_0_0 Closes: kubernetes#2101 Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
This commit adds the high level overview I proposed in the SIG-node meeting on Nov 2. The work divided in phases and intial support (phase 1 and 2) is disentangled from further improvements that community members wanted to see (phase 3). This incorporates the valuable feedback in the discussion at PR kubernetes#2101, making things as automatic as possible and adding a phase 3 for such improvements, while it also leaves room for future improvements too. Slides used in the Nov 2 SIG-node meeting are here: https://docs.google.com/presentation/d/1z4oiZ7v4DjWpZQI2kbFbI8Q6botFaA07KJYaKA-vZpg/edit#slide=id.gc6f73a04f_0_0 Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Opened a new PR that incorporates all the feedback from the discussion here: #3065 Please have a look If that one looks good, we can probably merge it and close this one. |
Thanks @rata for all the detailed follow up. I've been trying to get back to this one and keep getting pulled into other shorter term things... sorry about the radio silence. I will try to take a look through the new proposal soon. I think the three-phases thing is great. The complexity clearly ramps up at each subsequent phase, and each of those milestones is a great place to re-evaluate the progress based on what we learned for the previous. I'd love for phase 1 to get started, that seems like the simplest/least-controversial piece and it would be good to get the implementation moving so we can learn. |
@mtaufen Thank you! |
@mrunalp @derekwaynecarr PTAL |
Please note this is the old kep, the new one that obsoletes this old proposal is: #3065 |
This PR adds a KEP proposing to support user namespaces. It is not a new topic in Kubernetes as it has been discussed multiple times in the past. #127 was tracking the Support Node-Level User Namespaces Remapping design proposal. There was also a PR implementing that but it was never merged.
This new PR replaces #1903 that only contained the summary and motivations, this new PR now contains a complete proposal.
@rata @alban @sargun, @solarkennedy, @jigish, @giuseppe, @rhatdan
/cc @mrunalp
/sig node