-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubleshooting for job pod finalizers #43773
Troubleshooting for job pod finalizers #43773
Conversation
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
/lgtm I have also seen this. |
LGTM label has been added. Git tree hash: 40a7d39e125544dbc92a46672d376f7b87be233e
|
We have also updated the Job Tracking KEP as a common fail case. So it seems like a good idea to update the documentation as that has more eyes on it. |
/sig apps |
|
||
- **Pods stuck in Terminating state and Jobs don't complete** | ||
|
||
Check for validating or mutating webhooks for Pod updates. Some faulty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling such webhooks "faulty" is a bit strong here. Sometimes validating or mutating webhooks are configured (by users, either explicitly or implicitly) to match on UPDATE requests for entirely valid and useful purposes. In some cases, this can have undesirable consequences which is the case with just about every thing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to "misconfigured or faulty"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still seems like you're suggesting a webhook which is configured to react to UPDATE requests must be one of either "misconfigured" or "faulty" and that's not true. I get what you're trying to convey, but it could be stated to better reflect the variety of webhooks and their driving use cases. Here's my suggestion:
"Webhooks which are configured to match on UPDATE requests may inadvertently deny an update for a Pod that is already running or finished. This causes kube-apiserver to reject the update to remove the Pod finalizers, preventing the Job controller from making progress. Inspect and disable or modify the webhook or its managing controller's configuration to allow these updates."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be the case of kyverno, but a webhook that tries to add an initContainer to a running or finished Pod (kubernetes/kubernetes#121605) is a faulty webhook, because in under no circumstances that works.
If you configure a webhook to reject Pod updates for existing Pods created before the webhook was installed, that's a misconfiguration in the user side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The webhook, as an API resource, is not the issue here. It's what the service does with the AdmissionReview that it receives as a result of the webhook's configuration. A webhook can only ask for UPDATE ops; it has no logic to do anything with them. Having a webhook which matches on UPDATE to Pods is neither "faulty" nor represents a "misconfiguration." Having logic which attempts to add an initContainer, per your example, based upon the call-back it receives, is "faulty" or does represent a "misconfiguration."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I am referring to webhook as the combination the Webhook object and the service that responds to it. I don't think most users make the distinction between the two.
@sftim do you have any guidelines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some clarification may be in order lest this paragraph inspire users to go inspect all their webhooks and, upon finding any that match on UPDATE ops, proceed to <project/product> and proclaim it's a bug and shouldn't be there. The guidance is helpful for sure, but more specifics are needed to avoid making wrong assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See update. I'm still using "webhook" to refer to both, but I mention the CRDs when necessary. Let me know what you think.
c0915c7
to
1f9a0b9
Compare
#### Troubleshooting | ||
|
||
- **Pods stuck in Terminating state and Jobs don't complete** | ||
|
||
Check for validating or mutating webhooks for Pod updates. Misconfigured | ||
or faulty webhooks may attempt to modify running Pods or deny an update for a | ||
Pod that is already running or finished. This causes kube-apiserver to reject | ||
the update to remove the Pod finalizers, preventing the job controller from | ||
making progress. | ||
You should disable and/or fix the Pod update webhooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It helps readers when concept explanations are as short as we can reasonably make them.
I'd prefer to have the actual advice somewhere in https://kubernetes.io/docs/tasks/debug/debug-application/ and have this page hyperlink to the troubleshooting page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm... it doesn't fit in the existing articles. We would need a new "Debugging Jobs" or "Debugging control plane", which I don't have the time to write now :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could put it in "Debugging Pods" as "My pod is stuck in Terminating"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor we can file an issue asking for another (docs) contributor to pick up the baton here. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also OK with adding a section to Debugging Pods as an interim thing, and making the Debugging Jobs task page a longer term ambition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in "Debugging Pods". It might be the most appropriate place long term, after all.
1f9a0b9
to
5639f52
Compare
|
||
If you are the author of the webhook: | ||
- For a mutating webhook, make sure it never changes immutable fields on | ||
`UPDATE` operations. Typically, any container changes are not allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`UPDATE` operations. Typically, any container changes are not allowed. | |
`UPDATE` operations. |
In the https://kubernetes.io/blog/2023/05/12/in-place-pod-resize-alpha/, this is not true anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I can change it by "most", instead of "any".
I don't want to list everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea you can take what I say or leave it. Its an alpha feature so the amount of users modified container limts/requests will be very little. Just want to bring up that this could become less true over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this to say most at least?
I think that over time this statement will become incorrect.
We could update this once in place pod becomes beta/ga if you want.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased it.
Can I get a review for the last update? As clusters upgrade, they are being hit by these problems. |
LGTM label has been added. Git tree hash: f11b2d8825c2ed5c197f38fa22d86271677f9012
|
5639f52
to
33757a1
Compare
/lgtm |
LGTM label has been added. Git tree hash: dbab56cfbfdefcc7e6e8f0a3a9eb6bd8763325e7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more fix please
issued for the Pod, but the control plane is unable to delete the Pod object. | ||
|
||
This typically happens if the Pod has a [finalizer](/docs/concepts/overview/working-with-objects/finalizers/) | ||
and there is an [admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and there is an [admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) | |
and there is an [admission webhook](/docs/reference/access-authn-authz/extensible-admission-controllers/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, copy-paste problem :)
33757a1
to
46df889
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM label has been added. Git tree hash: 91cb6c17707ad92f615c6e6b2d7f894478413fc8
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sftim The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I see this problem more often than not.
Examples: