-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3815 : Add Condition for Pending Pods that are stuck due to configuration errors #3816
Conversation
/do-not-merge |
11369bd
to
59a199f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
keps/sig-node/3815-pod-condition-for-configuration-errors/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/3815-pod-condition-for-configuration-errors/README.md
Outdated
Show resolved
Hide resolved
--> | ||
|
||
### Non-Goals | ||
- Will not modify behavior for ImagePullBackOff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we would exclude this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought for this is that there isn't enough information in pulling images to distinguish between the different failures:
- If registry is overloaded I have seen cases where ImagePullBackOff occurs.
- If a docker image doesn't exist you get ImagePullBackOff.
- If you have invalid image pull secrets you also get ImagePullBackoff.
There could be an interest in spelling these cases out more but it isn't as important case for us at the moment. I wanted to focus on cases where the error requires human intervention to get it working again.
In some of these cases I could see pulling images to work again if the registry is no longer overloaded. But for 2-3 I guess these are cases that require human intervention but I don't know very much about how we pull images to distinguish between these different cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can tell the difference at SOME point, right? Whether registry returns 503 or 404, for example.
But in either case, the pod is still "failing to start"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I wasn't really going at it from a k8 dev perspective but from what is present in the pod yaml. I'm not sure if the rest API errors for the registry are exposed to a user when submitting pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Threy are not, but we know them somewhere deep down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Tim, we should set the condition while the pod doesn't start, even if it could be a transient error that doesn't require human intervention.
I also would recommend reading the API conventions:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
keps/sig-node/3815-pod-condition-for-configuration-errors/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/3815-pod-condition-for-configuration-errors/README.md
Outdated
Show resolved
Hide resolved
7506620
to
bd3ec8d
Compare
keps/sig-node/3815-pod-condition-for-configuration-errors/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/3815-pod-condition-for-configuration-errors/README.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will circle back later - this isn't queued for 1.27, right?
bd3ec8d
to
6ac6228
Compare
// ErrInvalidImageName - Unable to parse the image name. | ||
ErrInvalidImageName = errors.New("InvalidImageName") | ||
``` | ||
- Should we include all these errors as potential failure cases for pending pods? I can replicate ErrImageNeverPull and InvalidImageName but not sure how to reproduce the other errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for the reviewers. Love to hear some community input if any of these errors cause problems in a consistent way. If so I would be open to adding them as part of this KEP. Just would like to understand how to reproduce them.
ErrPostStartHook = errors.New("PostStartHookError") | ||
) | ||
``` | ||
- If these are set, can we assume that container failed? Not sure how to recreate failures of states other than ErrCreateContainerConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question for a reviewer: Are these internal errors that we should catch for a condition? Or do they occur often?
Correct. Just getting this open now. I have some questions for community on this so I figured it would be helpful to get some comments on it. I don't see a reason to push this into 1.27 and since I missed the KEP deadline so 1.28 it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kannon92 thanks for starting up the KEP. I think it could play nicely with podFailurePolicy KEP and with the fix I’m proposing here: #3757. However, there are details to figure out.
First, IIUC the flow could work like this (I think it would be good to clarify it here):
- Kubelet would add
FailingToStart
condition to a Pending and scheduled Pod, theReason
would indicate more details about the scenario. - An external controller inspects the Pod and may send DELETE request to the Pod
- The new code in Kubelet will transition the Pod into Failed phase (Update for second Beta with GA criteria for "KEP-3329: Retriable and non-retriable Pod failures for Jobs" #3757)
- Job controller can match the failed pod against the configured
podFailurePolicy
. The most common use case I assume would be to use “FailJob” or “Count” action for theFailingToStart
condition.
Another thought is - does Kubelet use any knowledge which isn't already surfaced here in pod status? If not, then maybe we develop an external component which adds the condition based on inspecting the Pod status. Maybe developing this idea as an external pluggable component makes sense so that we can gain more confidence in the solution over time? Then, the flow would be:
- Kubelet surfaces the information in Pod status
- An external controller inspects the Pod status, adds the condition, and sends DELETE request to the Pod
- The new code in Kubelet will transition the Pod into Failed phase (Update for second Beta with GA criteria for "KEP-3329: Retriable and non-retriable Pod failures for Jobs" #3757)
- Job controller can match the failed pod against the configured
podFailurePolicy
. The most common use case I assume would be to use “FailJob” or “Count” action for theFailingToStart
condition.
Next, I’m wondering if we can really easily separate the configuration errors from transient issues. If we can, then I suggest a more concrete name for the condition, such as “ConfigurationError”.
It probably can be done in the cases you enumerate, but some corner cases can be tricky to tell, such as ImagePullBackoff. Let’s even consider “Image not found locally when ImagePullPolicy is set to never.” Not sure if such a scenario is possible, but what if container runtime (containerd) crashed and was restarting when asked for the image. Also, potentially, someone could be loading the images onto the nodes while the pod is Pending phase. I think it would require a rigorous review to really tell them apart, and I’m not sure we want to go into this, but if we want, then we could have two conditions: “ConfigurationErrors” (permanent ones), and “FailingToStart” for the transient ones.
If we relax the semantics of FailingToStart
condition, then its name is generic enough so that it could also cover transient issues, such as ImagePullBackOff
(it could be a reason for the condition). Kubelet would add the condition FailingToStart=True
, but it would be in position to reset the condition to FailingToStart=False
if the image is pulled eventually. It would be up to the external controller to decide if FailingToStart
with Reason=ImagePullBackOff should trigger DELETE request. It could do it for example, after waiting 2min to gain more confidence that the issue is permanent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md | ||
--> | ||
|
||
As a batch user, it is very likely that users experience configuration errors when submitting pods to Kubernetes. We would like to propose a new condition so that users can determine if a pod has an error in startup. A new condition will provide a quick way to verify if a pod detected a unrecoverable configuration error. Many of these configuration errors are fixable via user interactions (ie creating a secret, changing the image name) but without any modifications the pods will stay stuck in the pending state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is specific to batch users.
"has an error" -> "is stuck"? Error seems to imply a terminal state, which is not the case here.
wouldn't it rather be a "recoverable" error?
any modifications user intervention
|
||
Batch users may not have deep knowledge of Kubernetes. It is common occurrence to find users submitting Pods with configuration errors (invalid images, wrong image names, missing volumes, missing secrets, wrong volume/secret for config map). These cases cause the Pod to be stuck in pending and there is code required to remove these pods or user input to remove. | ||
|
||
As a batch system developer, we find that higher level controllers have to write code to handle Pending Pod cases. Since the pods are not considered failed, the higher level controller needs to transition this state to failed in order to enforce retries. For example, the job controller will run these pods and the state of the Pod will be in Pending. The job controller does not transition this state to failed even if the pod will never start running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An additional problem both for end-users and system developers is that the errors are surfaced in inconsistent ways: some are in Container statuses, some in the reason or message fields. A standard condition would provide a single place to look at.
However, this paragraph is a little confusing when talking about transitioning to Failed state. IIUC, this KEP is not proposing to add a transition mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea good call out. I'll make sure to explicitly put that in there. and will clean up this paragraph so it isn't mentioning transition to a failed state.
|
||
As a batch system developer, we find that higher level controllers have to write code to handle Pending Pod cases. Since the pods are not considered failed, the higher level controller needs to transition this state to failed in order to enforce retries. For example, the job controller will run these pods and the state of the Pod will be in Pending. The job controller does not transition this state to failed even if the pod will never start running. | ||
|
||
For CNCF projects, in the Armada project, we implement logic that determines retry ability based on events and statuses. Events are not standard so we expect issues when updating and we discovered various states that don't have a representative status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Events are also best-effort, so they can easily be missed.
--> | ||
|
||
### Goals | ||
- Add a new condition FailingToStart to detect failures in configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have something like Started=False
instead? Hopefully it's different enough from the "PodHasNetwork" condition that is changing names soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a Ready condition so I could find that to be a bit confusing to explain the difference between Ready and Started. But naming is always hard so I'll leave the name up for debate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 that naming is hard :-) If a negative like Failing
(and implying ready/success through a double negative - i.e. Failing
= False
) can be avoided, that would be nice.
--> | ||
|
||
### Non-Goals | ||
- Will not modify behavior for ImagePullBackOff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Tim, we should set the condition while the pod doesn't start, even if it could be a transient error that doesn't require human intervention.
I also would recommend reading the API conventions:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
|
||
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels]. | ||
|
||
#### Alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok to leave this out while the KEP is marked as "provisional", but you'll need it for alpha.
--> | ||
|
||
###### Will enabling / using this feature result in increasing size or count of the existing API objects? | ||
No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a new condition
keps/sig-node/3815-pod-condition-for-configuration-errors/README.md
Outdated
Show resolved
Hide resolved
As an end-user, if I create a pod with an ImagePullPolicy of Never but image doesn't exist on the node, I would expect to see a condition that states that the pod failed to start due to a configuration error. | ||
#### Story 3 | ||
As an end-user, if I create a pod with a missing config map, I would expect to see a condition that states that the pod failed to start due to a configuration error. | ||
### Notes/Constraints/Caveats (Optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain the relationship/difference with PodHasNetwork
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a point to explain the difference.
cfb0a8a
to
7c41d13
Compare
Given kubernetes/kubernetes#115736 (comment) (@smarterclayton), we are back to square-one. I think the main questions we need to answer are:
|
I can buy Clayton's point - that's a bit of history I either forgot or didn't know. :) The larger problem is still real, though, right? Getting useful, detailed information back is still too hard... |
/cc |
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports | ||
--> | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this cover scenarios like these:
- init container that keeps running and never returns
- init container on the pod with the
restartPolicy=Always
keeps failing and being restarted - postStart hook of a first container never returns and prevents the second container to run
- startup probe never returned
success
and container keep being "started"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could. I'd have to investigate each of these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sysctl Forbidden is a case that I met before. The kubelet did not add it to the safe sysctl but the apiserver accepted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this cover scenarios like these:
- init container that keeps running and never returns
- init container on the pod with the
restartPolicy=Always
keeps failing and being restarted- postStart hook of a first container never returns and prevents the second container to run
- startup probe never returned
success
and container keep being "started"
I don't think this would cover init containers as I was aiming to test cases where pods are explicitly in pending. I don't think it doesn't have to support it but this brings up a point that this lifecycle handling could be useful for init containers, ephemeral containers and sidecars. Currently I was making an assumption that all init containers would have passed successful and then you move onto to main containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev Do we think this should be considered a nongoal or should we consider support for failures in init containers?
Generally these seem to handled a lot better because init containers have a time limit on them and if there are any issues in pulling/running init containers the entire pod lifecycle is blocked. They are also covered by an initContainers
condition so it is easy to distinguish that the pod is stuck in the initialized loop (Initialized=False).
4e42794
to
32eaed4
Compare
Any update or new review, is this target for v1.28 or v1.29? |
Sorry I missed this. I obviously missed v1.28 but I'd like to get back to this in v1.29. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kannon92 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
To answer your first point, this is mainly a single case that we can cover with a condition. We can match in cases of I'm coming back to this second one. My thought is that they are somewhat similar but different. I really want 3085 to be promoted to beta as the cases that this feature covers are causes of frustration in our batch platform. I am taking over this issue so we can get it promoted to beta. Generally I view these two conditions are separate information about the stage of pod creation but generally help answer the same question: How can I figure out easily if pods have failed? In terms of this feature, I view Scheduled -> PodReadyToStartContainers (true)-> FailingToStart (true) -> Ready This is at least how I think of the differences. We should add a condition to determine failures once sandbox creation is complete and the pods have progressed past. |
- Invalid Image Name | ||
- Image not found locally when ImagePullPolicy is set to never. | ||
- Existing config map but invalid key reference | ||
- Container lifecycle hook failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we are going to cover the following cases:
- Incorrect node selector
- Untolerated taints
- Requested resource too high for the selected node (cpu, memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are already covered by the Scheduled=False condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification
Hi, |
This is the last item I did. I aimed to promote 3085 to beta as there is a strong overlap with these. Main questions I have been thinking about is should we have a dedicated condition for image issues and another one for lifecycle issues? |
@kannon92 whats the word on this one? Push ahead, abandon, replace? |
So I think the main hope was to use some of this for another KEP that @alculquicondor was thinking of. kubernetes/kubernetes#122300 (comment) was my findings from this KEP. I think the idea is useful but I am unable to push this in my current role at the moment. I would like to focus on concrete use cases rather than a general condition. So I was hoping that maybe 122300 can provide a use case and a condition can be implemented to help implement that. But not sure where that feature was left off and its lacking an owner at the moment. |
Understood. Should be close this? |
yea, I messaged my team to see what they say about it but I'll close for now. |
@kannon92: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Add a new condition for pods that are stuck in pending due to a configuration error.
Adding FailingToStart Condition in cases of configuration errors. #3815