-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-18893: Rechecking pending Pods (conflict resolved) #196
OCPBUGS-18893: Rechecking pending Pods (conflict resolved) #196
Conversation
@nicklesimba: This pull request references Jira Issue OCPBUGS-18893, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (weliang@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
As spoken offline, this band-aid will cause stress on the API, thus impact workloads on other networks, including the cluster default network; the code should be refactored to rely on informers ASAP. Having said that, this issue with pending pods is real, and is addressed by this PR. @dougbtv we want to fix the pending pod issue, regardless, right ? /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: maiqueb, nicklesimba The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@nicklesimba: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@nicklesimba: Jira Issue OCPBUGS-18893: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-18893 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix included in accepted release 4.15.0-0.nightly-2023-09-27-073353 |
Fix included in accepted release 4.15.0-0.nightly-2024-01-13-050900 |
This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.
Solution description, as written on xagent003's upstream PR:
"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.
Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"
Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.