Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting for job pod finalizers #43773

Merged

Conversation

alculquicondor
Copy link
Member

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language labels Nov 1, 2023
@k8s-ci-robot k8s-ci-robot added sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 1, 2023
Copy link

netlify bot commented Nov 1, 2023

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 46df889
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/6552481bac37260008bef5ea
😎 Deploy Preview https://deploy-preview-43773--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@kannon92
Copy link
Contributor

kannon92 commented Nov 1, 2023

/lgtm

I have also seen this.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 40a7d39e125544dbc92a46672d376f7b87be233e

@kannon92
Copy link
Contributor

kannon92 commented Nov 1, 2023

We have also updated the Job Tracking KEP as a common fail case. So it seems like a good idea to update the documentation as that has more eyes on it.

@alculquicondor
Copy link
Member Author

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Nov 1, 2023

- **Pods stuck in Terminating state and Jobs don't complete**

Check for validating or mutating webhooks for Pod updates. Some faulty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling such webhooks "faulty" is a bit strong here. Sometimes validating or mutating webhooks are configured (by users, either explicitly or implicitly) to match on UPDATE requests for entirely valid and useful purposes. In some cases, this can have undesirable consequences which is the case with just about every thing else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to "misconfigured or faulty"

Copy link

@chipzoller chipzoller Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems like you're suggesting a webhook which is configured to react to UPDATE requests must be one of either "misconfigured" or "faulty" and that's not true. I get what you're trying to convey, but it could be stated to better reflect the variety of webhooks and their driving use cases. Here's my suggestion:

"Webhooks which are configured to match on UPDATE requests may inadvertently deny an update for a Pod that is already running or finished. This causes kube-apiserver to reject the update to remove the Pod finalizers, preventing the Job controller from making progress. Inspect and disable or modify the webhook or its managing controller's configuration to allow these updates."

Copy link
Member Author

@alculquicondor alculquicondor Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the case of kyverno, but a webhook that tries to add an initContainer to a running or finished Pod (kubernetes/kubernetes#121605) is a faulty webhook, because in under no circumstances that works.

If you configure a webhook to reject Pod updates for existing Pods created before the webhook was installed, that's a misconfiguration in the user side.

Copy link

@chipzoller chipzoller Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The webhook, as an API resource, is not the issue here. It's what the service does with the AdmissionReview that it receives as a result of the webhook's configuration. A webhook can only ask for UPDATE ops; it has no logic to do anything with them. Having a webhook which matches on UPDATE to Pods is neither "faulty" nor represents a "misconfiguration." Having logic which attempts to add an initContainer, per your example, based upon the call-back it receives, is "faulty" or does represent a "misconfiguration."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am referring to webhook as the combination the Webhook object and the service that responds to it. I don't think most users make the distinction between the two.

@sftim do you have any guidelines?

Copy link

@chipzoller chipzoller Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some clarification may be in order lest this paragraph inspire users to go inspect all their webhooks and, upon finding any that match on UPDATE ops, proceed to <project/product> and proclaim it's a bug and shouldn't be there. The guidance is helpful for sure, but more specifics are needed to avoid making wrong assumptions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See update. I'm still using "webhook" to refer to both, but I mention the CRDs when necessary. Let me know what you think.

@alculquicondor alculquicondor force-pushed the job-finalizer-troubleshoot branch from c0915c7 to 1f9a0b9 Compare November 1, 2023 15:15
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2023
@k8s-ci-robot k8s-ci-robot requested a review from kannon92 November 1, 2023 15:15
Comment on lines 941 to 950
#### Troubleshooting

- **Pods stuck in Terminating state and Jobs don't complete**

Check for validating or mutating webhooks for Pod updates. Misconfigured
or faulty webhooks may attempt to modify running Pods or deny an update for a
Pod that is already running or finished. This causes kube-apiserver to reject
the update to remove the Pod finalizers, preventing the job controller from
making progress.
You should disable and/or fix the Pod update webhooks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps readers when concept explanations are as short as we can reasonably make them.

I'd prefer to have the actual advice somewhere in https://kubernetes.io/docs/tasks/debug/debug-application/ and have this page hyperlink to the troubleshooting page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm... it doesn't fit in the existing articles. We would need a new "Debugging Jobs" or "Debugging control plane", which I don't have the time to write now :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could put it in "Debugging Pods" as "My pod is stuck in Terminating"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor we can file an issue asking for another (docs) contributor to pick up the baton here. Would that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also OK with adding a section to Debugging Pods as an interim thing, and making the Debugging Jobs task page a longer term ambition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in "Debugging Pods". It might be the most appropriate place long term, after all.

@alculquicondor alculquicondor force-pushed the job-finalizer-troubleshoot branch from 1f9a0b9 to 5639f52 Compare November 3, 2023 15:23
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 3, 2023

If you are the author of the webhook:
- For a mutating webhook, make sure it never changes immutable fields on
`UPDATE` operations. Typically, any container changes are not allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`UPDATE` operations. Typically, any container changes are not allowed.
`UPDATE` operations.

In the https://kubernetes.io/blog/2023/05/12/in-place-pod-resize-alpha/, this is not true anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can change it by "most", instead of "any".
I don't want to list everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea you can take what I say or leave it. Its an alpha feature so the amount of users modified container limts/requests will be very little. Just want to bring up that this could become less true over time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this to say most at least?

I think that over time this statement will become incorrect.

We could update this once in place pod becomes beta/ga if you want.

/lgtm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased it.

@alculquicondor
Copy link
Member Author

Can I get a review for the last update? As clusters upgrade, they are being hit by these problems.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f11b2d8825c2ed5c197f38fa22d86271677f9012

@alculquicondor alculquicondor force-pushed the job-finalizer-troubleshoot branch from 5639f52 to 33757a1 Compare November 13, 2023 15:46
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2023
@kannon92
Copy link
Contributor

/lgtm
from tech standpoint.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: dbab56cfbfdefcc7e6e8f0a3a9eb6bd8763325e7

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more fix please

issued for the Pod, but the control plane is unable to delete the Pod object.

This typically happens if the Pod has a [finalizer](/docs/concepts/overview/working-with-objects/finalizers/)
and there is an [admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and there is an [admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
and there is an [admission webhook](/docs/reference/access-authn-authz/extensible-admission-controllers/)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, copy-paste problem :)

@alculquicondor alculquicondor force-pushed the job-finalizer-troubleshoot branch from 33757a1 to 46df889 Compare November 13, 2023 16:00
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2023
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

/lgtm
LGTM also from #43773 (comment)

/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 91cb6c17707ad92f615c6e6b2d7f894478413fc8

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sftim

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2023
@k8s-ci-robot k8s-ci-robot merged commit d028549 into kubernetes:main Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants