Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where redeploy hangs forever because PDS condition is ignored #262

Merged
merged 2 commits into from
Apr 17, 2018

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented Apr 4, 2018

There's a logic error in deploy_failing_to_progress?: it assumes that the deploy currently doing the watching also initiated the latest changes to the deployment. Because of this, no-op redeploys of bad code will hang forever, as seen in this shipit stack:

This PR proposes a new assumption: that if the status generation matches the resource generation, the conditions in that status must refer to the latest version. I'm not entirely sure this is a safe assumption to make. It could reintroduce something like #213 (though that case was caused by a Kubernetes bug). Let me know if you have any better ideas. Another thing we could do is always return false if Time.now - deploy_started_at is less than PDS, without even checking the condition (like we do for <=1.7.7).

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 4, 2018

From kubernetes community docs about writing controllers:

If the primary resource your controller is reconciling supports ObservedGeneration in its status, make sure you correctly set it to metadata.Generation whenever the values between the two fields mismatches.

This lets clients know that the controller has processed a resource. Make sure that your controller is the main controller that is responsible for that resource, otherwise if you need to communicate observation via your own controller, you will need to create a different kind of ObservedGeneration in the Status of the resource.

And from the community api conventions doc:

Some resources report the observedGeneration, which is the generation most recently observed by the component responsible for acting upon changes to the desired state of the resource. This can be used, for instance, to ensure that the reported status reflects the most recent desired status.

@dturn
Copy link
Contributor

dturn commented Apr 4, 2018

Not sure if I've fully wrapped my head around this bug, but would it work to use the creationTimestamp of the replica set instead of @deploy_started_at. (We'd have to verify that the current rs generation == current deployment generation).

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 4, 2018

Do you mean in the original condition, i.e. Time.parse(@progress_condition["lastUpdateTime"]).to_i >= (@latest_rs.created_at - 5.seconds).to_i? The danger there would be that we'd never be able to time out when then the deployment controller is failing to create a new replicaset in the first place (which is something we've actually seen in production due to a k8s bugs with type: Recreate deployments).

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 6, 2018

@mkobetic can you weigh in on the logic here? i.e. what do you think about having the deploy make this assumption:

This PR proposes a new assumption: that if the status generation matches the resource generation, the conditions in that status must refer to the latest version.

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 10, 2018

Chatted with Martin IRL and he agrees that the generation-based assumption is good. I've documented it in an inline comment. Based on his feedback, I've also removed the check I had added on the reason text. The current list of possible reasons is here, and we'd want to time out on all of the negative ones.

@@ -31,6 +33,7 @@ def sync
@progress_deadline = @definition['spec']['progressDeadlineSeconds']
@desired_replicas = -1
@max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
@observed_generation = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a dummy value for @current_generation too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants