-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose a feature to troubleshoot running pods #649
Conversation
/assign @dashpole |
/lgtm |
I think this proposal needs wider review since it comes with API changes. |
Related to feature kubernetes/enhancements#277 |
``` | ||
type PodStatus struct { | ||
... | ||
DebugContainerStatuses []ContainerStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this feature is alpha, doesn't the field name need to reflect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make sense, I just didn't know that was how to do it. I know there's been a lot of discussion in kubernetes/kubernetes#30819, but I haven't followed it. It doesn't look like there's consensus, but I can try to cherry-pick something that might work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we do not need to update the filed name but how about using annotations
like init container
? Such as adding an annotation as pod.alpha.kubernetes.io/debug-containers
? You can get more detail from https://github.com/kubernetes/community/blob/master/contributors/design-proposals/container-init.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely not store something as important as debug container info in an annotation. Those are also compatibility nightmares when they transition to supported fields.
@kubernetes/sig-api-machinery-misc for guidance on alpha field names and gating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt can you please show more detail for this? I found that the init container
was using this logic to be from alpha->beta->graduate, why cannot the debug container
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did, and it has been a very painful transition, with very poor user experience (c.f. kubernetes/kubernetes#45627 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks @liggitt
I think a section that covers the security concerns associated with this feature would be appreciated for future readers. |
This creates an interactive shell in a pod which can examine and signal all | ||
processes in the pod. It has access to the same network and IPC as processes in | ||
the pod. It can access the filesystem of other processes by `/proc/$PID/root`, | ||
and enter aribitrary namespaces of another container via `nsenter` when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to hop from one container to another within a pod with nsenter today? is the proposed debug container more powerful than a bash prompt inside an existing container obtained via exec today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nsenter is possible today with CAP_SYS_ADMIN. Debug containers aren't different from containers in Kubernetes today except that they aren't in a pod spec. If you were to add the following container to a pod spec:
- name: shell
image: debian
stdin: true
tty: true
securityContext:
capabilities:
add:
- SYS_ADMIN
Then, as long as there was a shared pid namespace, you can attach and nsenter other processes.
SYS_ADMIN wouldn't be granted to Debug Containers by default, but it should be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes master with a new enough docker version uses a shared pid namespace by default now.
|
||
The process for creating a Debug Container is: | ||
|
||
1. `kubectl` constructs a `v1.Container` based on command line flags and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1.Container
is not a top-level object (it has no objectmeta or typemeta)... can you describe the wrapper object that would be posted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added description of a new object PodDebugContainer
|
||
1. `kubectl` constructs a `v1.Container` based on command line flags and | ||
`POST`s it to `/api/v1/namespaces/$NS/pods/$POD/debug`. | ||
1. The API server performs admission control and proxies the connection to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means that all admission plugins that gate on pods
, pods/exec
, and pods/attach
would need to be updated to guard a new kind. Expanding the surface area an admission plugin needs to protect will become a bigger deal when we have out of tree admission plugins (the mechanism for which is in progress)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noted this in a new security considerations section. It sounds like it will be good to get this change in prior to admission plugins becoming GA, though I would hope the plugins will deny resources they don't recognize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope the plugins will deny resources they don't recognize.
they do not, they select the resources/subresources they guard (and admission plugins are already GA, it's the externalized ones that are being developed)
is used because `/debug` is already used by the kubelet. `/podDebug` was | ||
chosen to parallel existing endpoints like `/containerLogs`. | ||
1. The kubelet instructs the Generic Runtime Manager (this feature is only | ||
implemented for the CRI) to create a Debug Container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is availability of this feature determined, if only some CRI implementations support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this was poorly worded. What I was trying to say is that this is implemented using the CRI and not the legacy runtimes. It doesn't require any changes to the CRI and will work with any runtime implementing the interface. I've updated the wording in my copy.
It is an error to attempt to create a Debug Container with the same name as a | ||
container that exists in the pod spec. There are no limits on the number of | ||
Debug Containers that can be created in a pod, but exceeding a pod's resource | ||
allocation may cause it to be evicted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify what would be evicted? the pod? the debug container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pod would be evicted. Updated the doc.
to streaming. | ||
|
||
It is an error to attempt to create a Debug Container with the same name as a | ||
container that exists in the pod spec. There are no limits on the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what component detects this error and what response is returned in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The apiserver detects the error and returns a BadRequest, i.e.:
will update doc.
policy. | ||
* Explicit reattaching isn't implemented. Instead a `kubectl debug` invocation | ||
will implicitly reattach if there is an existing, running container with the | ||
same name. In this case container configuration will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoring specified information and implicitly reattaching seems confusing, and not like something we'd want long-term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree 100%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if this moves from alpha to beta to stable, how do you maintain skew compatibility with older kubectl clients' debug implementations without propagating the implicit throw-away-info-and-reattach behavior into the stable version of the feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we just expand kubectl attach
to support debug containers, that would solve the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either way this has to be resolved prior to moving out of alpha
|
||
`startContainer()` will be updated to write a new label | ||
`io.kubernetes.container.type` to the runtime. Existing containers will be | ||
started with a type of `REGULAR` or `INIT`. When added in a subsequent step, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean a 1.7 kubelet managing containers without this label (that were started by a 1.6 kubelet) will be confused about whether they are debug containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's backwards compatible (and there's a test). The REGULAR
and INIT
labels aren't used by anything initially and the kubelet behavior only differs when Type == "DEBUG"
. (Type will be an empty string for a container that existed prior to the feature being enabled.)
|
||
### Additional Constraints | ||
|
||
1. Non-interactive workloads are explicitly supported. There are no plans to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the container run and attach steps are distinct, how is stdout/stderr coordinated so that the attach request obtains the first byte written to each? Is the first attach special? How do subsequent or additional attaches behave w.r.t. previous output from the debug container's entrypoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, there's a race here since the runtime is buffering the output. The container starts and its initial output goes to the buffer (visible via kubectl log
) and then the attach picks up mid-stream. This is fine for the interactive troubleshooting but might be a problem for non-interactive workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if @ncdc has learned anything, it is that if an output buffer race can happen, it will, and that if it might be a problem, it will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt the typical process for starting a docker container is (outside of kubernetes):
- create container
- attach to container
- start container
If you must start the container prior to attaching (which tends to be the case for things likes kubectl run
), then your only option to make sure you see all prior output is to specify logs=true
when attaching. This has downsides: last time I checked, you can't limit the output to e.g. the last 100 lines, and if you have a TTY, I'm not sure what happens in that case. Also note, this isn't currently available in the version of the docker api vendored in to kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncdc What's the longer term direction here? It would be significantly more work, but we could theoretically add create/attach/start functionality for Debug Containers. I wouldn't want to do it for an MVP, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wrote in another comment about this, but I think given the way the kubelet sync loop works, it will be very difficult to achieve create/attach/start without blocking one of the kubelet workers. We will need to discuss with sig-node if we want to pursue this.
1. Non-interactive workloads are explicitly supported. There are no plans to | ||
supported detached workloads, but doing so would be trivial with an | ||
`attach=false` flag. | ||
1. There are no guaranteed resources for ad-hoc troubleshooting. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it would make debugging a pod that was memory constrained pretty difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but in practice that's not been a problem we've had so far and we have to start somewhere. This could be improved in the future with the planned vertical pod autoscaling feature.
how do debug containers interact with graceful termination of pods? |
@liggitt Debug containers receive the same signals as other containers in the pod for lifecycle events. They differ only in that they aren't deleted when syncing pod spec while the pod is alive. |
/assign @pwittrock |
``` | ||
|
||
It would be reasonable for Kubernetes to provide a default container name and | ||
image1, making the minimal possible debug command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected, thanks.
image1, making the minimal possible debug command: | ||
|
||
``` | ||
kubectl debug -it target-pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you'd ever want to run kubectl debug
and not attach stdin and use a tty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't, no, but then again I'm not sure why the option exists in kubectl exec
. I can imagine wanting to run kubectl debug target-pod -- netstat -an
, but only if I'll definitely get the first byte of the output stream and that would of course work just as well with a tty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the document to specify that -i & -t will be enabled by default.
This creates an interactive shell in a pod which can examine and signal all | ||
processes in the pod. It has access to the same network and IPC as processes in | ||
the pod. It can access the filesystem of other processes by `/proc/$PID/root`, | ||
and enter aribitrary namespaces of another container via `nsenter` when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes master with a new enough docker version uses a shared pid namespace by default now.
replace a Debug Container that has exited by re-using a Debug Container name. It | ||
is an error to attempt to replace a Debug Container that is still running. | ||
|
||
One way in which `kubectl debug` differs from `kubectl exec` is the ability to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubectl attach
supports attaching to both init and normal containers. Would you want to expand it to support debug containers too? That would require the least amount of coding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, well this is a compelling alternative. I may have misunderstood your intention, but not streaming the /debug
subresource and instead relying solely on attach would solve several problems. It would sidestep (though not solve) output stream coordination and allow kubectl to generate the container configuration, which is more flexible. Off the top of my head:
kubectl debug
would do a 2 step run debug container followed by an optional attach. The optional attach would better support non-interactive workloads.- the apiserver can't check that the debug container exists by examining Pod.Spec as it does for regular/init containers, but it should be able to check Pod.Status.DebugContainerStatuses. It's the same story for kubectl.
Output stream coordination would then be solved for Debug Containers when/if it's solved for attach.
Great idea, I'll prototype it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only true way to ensure you don't miss any output is create-container, attach-container, start-container, in that order. That's how docker run
works. For something like kubectl run
and probably kubectl debug
too, we can't really do that, because of the way the kubelet sync loop works (kubectl waits until it sees the pod is Running before attaching). Well, I guess we could potentially do that, but it would require pausing a sync loop iteration until the remote client (kubectl) attaches, which isn't ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this info to the doc, thanks!
### Killing Debug Containers | ||
|
||
Debug containers will not be killed automatically until the pod (specifically, | ||
the pod sandbox) is destroyed. Unlike `kubectl exec`, Debug Containers will not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is roundabout support in newer docker versions for killing exec sessions. It now records the pid of the process that was exec'd, and we could use that information to do a kill.
the pod sandbox) is destroyed. Unlike `kubectl exec`, Debug Containers will not | ||
receive an EOF if their connection is interrupted. Instead, Debug Containers | ||
must be reattached to exit a running process. This could be tricky if the | ||
process does not allocate a TTY, in this case a second Debug Container could be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above re why would you not allocate a tty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there's a reason that non-interactive workloads might shun a TTY I have no argument for not having one. I was just following the perceived convention of kubectl exec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would view kubectl debug
as an interactive utility for debugging a running pod, whereas I see kubectl exec
as a tool in my toolbox that might or might not require user interaction. Although the more I think about it, typically when I do use kubectl exec
it's to get a shell, in which case I almost always do -it
. There's a recent issue proposing that exec defaults these to true: #46300.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncdc @liggitt since kubectl debug
is mainly a tool for getting a shell with a TTY, how much do we care about output stream coordination when kubectl
will already say "press enter to get a prompt"? I think it's reasonable to defer to a future general solution rather than trying to solve this for debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections to deferring
|
||
### Additional Constraints | ||
|
||
1. Non-interactive workloads are explicitly supported. There are no plans to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt the typical process for starting a docker container is (outside of kubernetes):
- create container
- attach to container
- start container
If you must start the container prior to attaching (which tends to be the case for things likes kubectl run
), then your only option to make sure you see all prior output is to specify logs=true
when attaching. This has downsides: last time I checked, you can't limit the output to e.g. the last 100 lines, and if you have a TTY, I'm not sure what happens in that case. Also note, this isn't currently available in the version of the docker api vendored in to kubernetes.
policy. | ||
* Explicit reattaching isn't implemented. Instead a `kubectl debug` invocation | ||
will implicitly reattach if there is an existing, running container with the | ||
same name. In this case container configuration will be ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we just expand kubectl attach
to support debug containers, that would solve the problem.
of Debug Containers is reported via a new field in `v1.PodStatus`, described in | ||
a subsequent section. | ||
|
||
#### Alternative: Extending `/exec` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchen1107 @lavalamp @smarterclayton @pwittrock
For pod troubleshooting we need to choose between the object-based approach described above (suggested by @smarterclayton ) and the exec-based approach described below (suggested by @lavalamp and resembling the "image exec" approach of the original proposal, interestingly).
13 ? Ss 0:00 bash | ||
26 ? Ss+ 0:00 /neato | ||
107 ? R+ 0:00 ps x | ||
root@debug-image:~# cat /proc/26/root/etc/resolv.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think this works with docker.
This updates the Pod Troubleshooting Design Proposal for recent developments in the community and to reflect the consensus from the API review: using the existing /exec endpoint as a starting point for this feature.
bbca2a1
to
8136e3d
Compare
... | ||
// DebugName is the name of the Debug Container. Its presence will cause | ||
// exec to create a Debug Container rather than performing a runtime exec. | ||
DebugName string `json:"debugName,omitempty" ...` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for making a sub section:
type PodExecOptions struct {
...
EphemeralContainer *PodExecEphemeralContainerSpec
}
type PodExecEphemeralContainerSpec struct {
Name string
Image string
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, though in my naive prototype the sub section wasn't populated from the HTTP params.
I'm not very familiar with the api machinery so I probably just missed something. I see where queryparams flattens the struct based on JSON names. Do I need to write a custom converter somewhere to get it back from params to an object? Could you point me in the right direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, nested structs parsed from query params won't work cleanly (kubernetes/kubernetes#21476)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier in the text it said this was not using query params--I thought it was using a message sent after the SPDY channel was opened.
We will extend `v1.Pod`'s `/exec` subresource to support "executing" container | ||
images. The current `/exec` endpoint must implement `GET` to support streaming | ||
for all clients. We don't want to encode a (potentially large) `v1.Container` as | ||
an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I misunderstood this line.
type PodExecOptions struct { | ||
... | ||
// Run Command in an ephemeral container which shares some namespaces with Container. | ||
EphemeralContainer PodExecEphemeralContainerSpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a pointer. Uh, but I will suggest something else since sub structs are indeed currently broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// PodExecOptions is the query options to a Pod's remote exec call
type PodExecOptions struct {
...
// EphemeralContainerName is the name of an ephemeral container in which the
// command ought to be run. Either both EphemeralContainerName and
// EphemeralContainerImage fields must be set, or neither.
EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
// EphemeralContainerImage is the image of an ephemeral container in which the command
// ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
// fields must be set, or neither.
EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed as suggested
1dd396f
to
473c49d
Compare
/lgtm |
Automatic merge from submit-queue. |
``` | ||
type PodStatus struct { | ||
... | ||
DebugStatuses []DebugStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this mirror the exec option names? ephemeral statuses, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use consistent naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, yes, I was in a rush and didn't see this section. @verb can you modify to
type PodStatus struct {
EphemeralContainerStatuses []v1.ContainerStatus
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Actually were you trying to represent all exec actions with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(note I edited my comment above. I can't see anything you need that isn't already in v1.ContainerStatus.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp I'd like to at least have command and args, which aren't part of ContainerStatus.
|
||
1. `kubectl` invokes the debug API as described in the preceding section. | ||
1. The API server checks for name collisions with existing containers, performs | ||
admission control and proxies the connection to the kubelet's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that all admission plugins that do anything related to checking containers, images, etc, would need to be updated to check ephemeral images specified in exec options now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is noted in the "Security Considerations" section.
requests and the kubelet must return an error to all but one request. | ||
|
||
There are no limits on the number of Debug Containers that can be created in a | ||
pod, but exceeding a pod's resource allocation may cause the pod to be evicted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also would bypass admission plugins that set resource limits/range on containers. please describe the container spec that would result from an ephemeral exec request
1. `KillPod()` already operates on all running containers returned by the | ||
runtime. | ||
1. Containers created prior to this feature being enabled will have a | ||
`containerType` of `""`. Since this does not match `"DEBUG"` the special |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEBUG or EPHEMERAL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please change references in the API to "ephemeral" everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this is a private label in the kubelet's runtime manager and not part of the API. I've updated it to EPHEMERAL for consistency, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a discussion of how this will be secure (users with access to CAP_SYS_ADMIN)
* Exited Debug Containers will be garbage collected as regular containers and | ||
may disappear from the list of Debug Container Statuses. | ||
* Security Context for the Debug Container is not configurable. It will always | ||
be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So only a cluster admin should ever use debug containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, pods don't have security context set in many cases. Exec implicitly escalating to root is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be configurable at debug time, but that depends on the API. Some of my proposed API changes addressed this, but SIG Node and the API reviewers deadlocked on which was best. Our compromise was to proceed to alpha with the minimum possible API change, which doesn't include a configurable security context.
Created kubernetes/kubernetes#53188 to track this.
/cc @thockin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our compromise was to proceed to alpha with the minimum possible API change, which doesn't include a configurable security context.
you have a pointer to where that discussion happened? no one from the auth/psp side was involved afaik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, you've been involved since the first draft of the proposal 10 months ago, and your input has always been most welcome. It's not too late, what would you like to see changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@verb how bad would it be to pass a full v1.Container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin Not bad, It's what the kubelet does internally and I've had it working in a prototype.
We would use the API described in Alternative 1 to POST a v1.Container (wrapped in a new top level object) to a new /debug subresource. Then the client would perform a separate /attach.
The API reviewers had concerns about this being a novel use of the API. Nothing else POSTs to a subresource.
Both the API reviewers and SIG Node reviewers had concerns about using v1.Container being confusing or communicating the wrong intent to the user. Debug Containers are not general purpose containers and should not be used to build services or for routine operations. Some prefer to consider "extended exec" rather than "configurable container".
Most of the fields of v1.Container do not apply to Debug Containers and should be rejected if configured (lifecycle, livenessProbe, ports, readinessProbe, resources, stdin, stdinOnce, terminationMessagePath, terminationMessagePolicy, tty, volumeMounts). We can do this with a validation whitelist, though it would be simpler to pass a securityContext rather than a full v1.Container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API reviewers had concerns about this being a novel use of the API. Nothing else POSTs to a subresource.
I thought scheduler POSTs to pod/binding subresource, clients POST to pod/eviction subresource, etc. Or is there a distinction between POST and PUT here? (Which I can never keep straight.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp Oh, that's good news then. When I prototyped this ~6 months ago I recall needing a couple of changes in the apiserver in order to make it work, but maybe those were specific to upgrading a connection to streaming after a POST. It's been a long ride.
Let's figure out a way to move this forward. Since we already had agreement among the reviewers at the time, and now we want to renegotiate that agreement based on new reviewers, I suggest that take the form of a new PR to amend the proposal where the new and old reviewers can work out their conflicting requirements. I'll prepare a diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API reviewers had concerns about this being a novel use of the API. Nothing else POSTs to a subresource.
I thought scheduler POSTs to pod/binding subresource, clients POST to pod/eviction subresource, etc. Or is there a distinction between POST and PUT here? (Which I can never keep straight.)
Not sure about subresources, but there are issues with both PUT and POST as part of websocket requests (not all clients support them)
``` | ||
type PodStatus struct { | ||
... | ||
DebugStatuses []DebugStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use consistent naming
particular, they should enforce the same container image policy on the `Image` | ||
parameter as is enforced for regular containers. During the alpha phase we will | ||
additionally support a container image whitelist as a kubelet flag to allow | ||
cluster administrators to easily constraint debug container images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what security context settings (uid/gid, selinux, apparmor) will the debug container have? how will admission plugins that constraint/force those (like PodSecurityPolicy) govern an ephemeral container
edit: just saw https://github.com/kubernetes/community/pull/649/files#diff-5cfb31b40ca47511743d0545d5697aa0R394
can we determine the equivalent v1.Container
(including securitycontext) that would correspond to the ephemeral container? If so, we could see if a PodSecurityPolicy would allow the pod with a container with those settings/permissions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought @verb had worked this out with @derekwaynecarr already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Compatibility with admission plugins is a top priority and a strict requirement. The implementation will depend a little bit on how the Kubernetes API settles, but it will be one of:
- The client may end up providing tje v1.Container that creates the Ephemeral Container
- If we stick with the imperative, exec-style API we can do exactly as you suggest and provide the v1.Container based on the PodExecOptions.
* Security Context for the Debug Container is not configurable. It will always | ||
be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`. | ||
* Image pull policy for the Debug Container is not configurable. It will | ||
always be run with `PullAlways`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this prevents offline installations with pre-pulled images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Created kubernetes/kubernetes#53189 to track this.
DebugStatuses []DebugStatus | ||
} | ||
|
||
type DebugStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the necessity of this.
If you wanted to represent the way(s) in which the container is "dirty" after exec/attach/portrforward etc, I don't think this is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's something we want to represent along with details about what command was run by exec (whether traditional exec or in an ephemeral container). This isn't a blocker for the alpha implementation, though, so if you think this is the wrong approach then I'll remove this bit from the proposal and we can figure out the correct way later.
/cc @dchen1107 @thockin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want some way to represent the taintedness (also should do for exec )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r2d4 Can you work on the best way to represent taintedness in v1.PodStatus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@verb ack looking into it
Hi @verb , I'm glad to see your repeated effort to get this feature added to k8s. I understand the initial use-case for this effort is related to diagnostic on a running Pod, but would like to share with you another use-case which would benefit this exact same improvement: I'm working on Jenkins integration in k8s, we run containerized builds as pods. During the build execution, a new container might be required. In many case this can be identified before the build is scheduled and as such set as part of the Pod's spec, but in some cases the required image is dynamically selected as part of the build. Also, being able to run containers as part of the build just like developer do on their workstation helps to make the build script reproducible and portable. with a new API to add (transient) containers to a Pod I could provide the glue code for build script to control such additional containers. The current usage for most users is to run a privileged (!) DinD container to host the build, or to bind mount docker.sock from host (!). As you can guess I'd prefer we don't rely on this :-\ Hope this will help understand potential use-cases this feature could support. |
I've been thinking along lines like this - we don't want to add infinite
features to Kubernetes pods that allow them to orchestrate containers, but
we also don't want to make pods so inflexible that people build external
container orchestration. We have clearly stated that the pod abstraction
is not the Borg Job/Tasks abstraction (directed graph), but allowing people
to implement directed graph operations within a single pod has utility.
If instead, a pod could leverage a node local API to add / remove
containers within the limits the pod has defined (security boundaries,
resources, secrets, volumes) but that the orchestration of those containers
could be done by talking to the kubelet within the pod, then we could
potentially continue to leverage CRI for intra-pod isolation AND avoid
needing to add an infinite number of features to the pod api. I think it
deserves some thought as an approach.
…On Mon, Jun 25, 2018 at 3:05 AM, Nicolas De loof ***@***.***> wrote:
Hi @verb <https://github.com/verb> , I'm glad to see your repeated effort
to get this feature added to k8s.
I understand the initial use-case for this effort is related to diagnostic
on a running Pod, but would like to share with you another use-case which
would benefit this exact same improvement:
I'm working on Jenkins integration in k8s, we run containerized builds as
pods. During the build execution, a new container might be required. In
many case this can be identified *before* the build is scheduled and as
such set as part of the Pod's spec, but in some cases the required image is
dynamically selected as part of the build. Also, being able to run
containers as part of the build just like developer do on their workstation
helps to make the build script reproducible and portable.
with a new API to add (transient) containers to a Pod I could provide the
glue code for build script to control such additional containers. The
current usage for most users is to run a privileged (!) DinD container to
host the build, or to bind mount docker.sock from host (!). As you can
guess I'd prefer we don't rely on this :-\
Hope this will help understand potential use-cases this feature could
support.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#649 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pwgpJIR1A7P9-1qdHkSNueZ5QXiQks5uAIvDgaJpZM4NinaN>
.
|
Note that I'm specifically saying:
Once ephemeral containers are available, we have the rough shape of an api
to further subdivide a pod (subcontainer division). We could potentially
put a whole class of problems ("how do i orchestrate a set of containers
within the pod") into the same pot by having the user run a container that
spawns ephemeral containers (in whatever fashion the user wants) and using
a combination of process and container controls to manage subdivision.
I.e. instead of kube implementing a task like mechanism within a pod, make
it possible for the user to safely implement that themselves. The
ephemeral pod API then becomes an "in pod orchestrator" which exposes the
container runtime in a kube-like fashion to the user, rather than forcing
the user to implement nested containers (which are invisible to kube) or
having kube continue to add feature after feature to approximate a directed
task graph.
…On Mon, Jul 2, 2018 at 6:23 PM, Clayton Coleman ***@***.***> wrote:
I've been thinking along lines like this - we don't want to add infinite
features to Kubernetes pods that allow them to orchestrate containers, but
we also don't want to make pods so inflexible that people build external
container orchestration. We have clearly stated that the pod abstraction
is not the Borg Job/Tasks abstraction (directed graph), but allowing people
to implement directed graph operations within a single pod has utility.
If instead, a pod could leverage a node local API to add / remove
containers within the limits the pod has defined (security boundaries,
resources, secrets, volumes) but that the orchestration of those containers
could be done by talking to the kubelet within the pod, then we could
potentially continue to leverage CRI for intra-pod isolation AND avoid
needing to add an infinite number of features to the pod api. I think it
deserves some thought as an approach.
On Mon, Jun 25, 2018 at 3:05 AM, Nicolas De loof ***@***.***
> wrote:
> Hi @verb <https://github.com/verb> , I'm glad to see your repeated
> effort to get this feature added to k8s.
>
> I understand the initial use-case for this effort is related to
> diagnostic on a running Pod, but would like to share with you another
> use-case which would benefit this exact same improvement:
>
> I'm working on Jenkins integration in k8s, we run containerized builds as
> pods. During the build execution, a new container might be required. In
> many case this can be identified *before* the build is scheduled and as
> such set as part of the Pod's spec, but in some cases the required image is
> dynamically selected as part of the build. Also, being able to run
> containers as part of the build just like developer do on their workstation
> helps to make the build script reproducible and portable.
>
> with a new API to add (transient) containers to a Pod I could provide the
> glue code for build script to control such additional containers. The
> current usage for most users is to run a privileged (!) DinD container to
> host the build, or to bind mount docker.sock from host (!). As you can
> guess I'd prefer we don't rely on this :-\
>
> Hope this will help understand potential use-cases this feature could
> support.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#649 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABG_pwgpJIR1A7P9-1qdHkSNueZ5QXiQks5uAIvDgaJpZM4NinaN>
> .
>
|
I have a keen interest for this feature. Does anyone have movement to report on this outside of this ticket? Also, to echo @ndeloof mentions for debuging Jenkins, I have to say this is indeed quite useful. I've been running Concourse at a few shops over recent years, and using Pivotal's Garden OCI runtime they achieve exactly that. |
@avanier kubernetes/enhancements#277 might be better for tracking progress of this feature. The API change is under review in kubernetes/kubernetes#59416, once the API changes there should be quick progress. |
Automatic merge from submit-queue. Propose a feature to troubleshoot running pods This feature allows troubleshooting of running pods by running a new "Debug Container" in the pod namespaces. This proposal was originally opened and reviewed in kubernetes/kubernetes#35584. This proposal needs LGTM by the following SIGs: - [ ] SIG Node - [ ] SIG CLI - [ ] SIG Auth - [x] API Reviewer Work in Progress: - [x] Prototype `kubectl attach` for debug containers - [x] Talk to sig-api-machinery about `/debug` subresource semantics
This feature allows troubleshooting of running pods by running a new "Debug Container" in the pod namespaces.
This proposal was originally opened and reviewed in kubernetes/kubernetes#35584.
This proposal needs LGTM by the following SIGs:
Work in Progress:
kubectl attach
for debug containers/debug
subresource semantics