-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where redeploy hangs forever because PDS condition is ignored #262
Conversation
From kubernetes community docs about writing controllers:
And from the community api conventions doc:
|
Not sure if I've fully wrapped my head around this bug, but would it work to use the creationTimestamp of the replica set instead of @deploy_started_at. (We'd have to verify that the current rs generation == current deployment generation). |
Do you mean in the original condition, i.e. |
@mkobetic can you weigh in on the logic here? i.e. what do you think about having the deploy make this assumption:
|
Chatted with Martin IRL and he agrees that the generation-based assumption is good. I've documented it in an inline comment. Based on his feedback, I've also removed the check I had added on the reason text. The current list of possible reasons is here, and we'd want to time out on all of the negative ones. |
@@ -31,6 +33,7 @@ def sync | |||
@progress_deadline = @definition['spec']['progressDeadlineSeconds'] | |||
@desired_replicas = -1 | |||
@max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable') | |||
@observed_generation = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set a dummy value for @current_generation
too?
There's a logic error in
deploy_failing_to_progress?
: it assumes that the deploy currently doing the watching also initiated the latest changes to the deployment. Because of this, no-op redeploys of bad code will hang forever, as seen in this shipit stack:This PR proposes a new assumption: that if the status generation matches the resource generation, the conditions in that status must refer to the latest version. I'm not entirely sure this is a safe assumption to make. It could reintroduce something like #213 (though that case was caused by a Kubernetes bug). Let me know if you have any better ideas. Another thing we could do is always return false if
Time.now - deploy_started_at
is less than PDS, without even checking the condition (like we do for <=1.7.7).