Fix bug where redeploy hangs forever because PDS condition is ignored #262

KnVerey · 2018-04-04T00:32:57Z

There's a logic error in deploy_failing_to_progress?: it assumes that the deploy currently doing the watching also initiated the latest changes to the deployment. Because of this, no-op redeploys of bad code will hang forever, as seen in this shipit stack:

This PR proposes a new assumption: that if the status generation matches the resource generation, the conditions in that status must refer to the latest version. I'm not entirely sure this is a safe assumption to make. It could reintroduce something like #213 (though that case was caused by a Kubernetes bug). Let me know if you have any better ideas. Another thing we could do is always return false if Time.now - deploy_started_at is less than PDS, without even checking the condition (like we do for <=1.7.7).

KnVerey · 2018-04-04T00:46:24Z

From kubernetes community docs about writing controllers:

If the primary resource your controller is reconciling supports ObservedGeneration in its status, make sure you correctly set it to metadata.Generation whenever the values between the two fields mismatches.

This lets clients know that the controller has processed a resource. Make sure that your controller is the main controller that is responsible for that resource, otherwise if you need to communicate observation via your own controller, you will need to create a different kind of ObservedGeneration in the Status of the resource.

And from the community api conventions doc:

Some resources report the observedGeneration, which is the generation most recently observed by the component responsible for acting upon changes to the desired state of the resource. This can be used, for instance, to ensure that the reported status reflects the most recent desired status.

dturn · 2018-04-04T15:21:09Z

Not sure if I've fully wrapped my head around this bug, but would it work to use the creationTimestamp of the replica set instead of @deploy_started_at. (We'd have to verify that the current rs generation == current deployment generation).

KnVerey · 2018-04-04T18:40:38Z

Do you mean in the original condition, i.e. Time.parse(@progress_condition["lastUpdateTime"]).to_i >= (@latest_rs.created_at - 5.seconds).to_i? The danger there would be that we'd never be able to time out when then the deployment controller is failing to create a new replicaset in the first place (which is something we've actually seen in production due to a k8s bugs with type: Recreate deployments).

KnVerey · 2018-04-06T21:59:27Z

@mkobetic can you weigh in on the logic here? i.e. what do you think about having the deploy make this assumption:

This PR proposes a new assumption: that if the status generation matches the resource generation, the conditions in that status must refer to the latest version.

KnVerey · 2018-04-10T17:46:38Z

Chatted with Martin IRL and he agrees that the generation-based assumption is good. I've documented it in an inline comment. Based on his feedback, I've also removed the check I had added on the reason text. The current list of possible reasons is here, and we'd want to time out on all of the negative ones.

dturn · 2018-04-04T15:09:01Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -31,6 +33,7 @@ def sync
        @progress_deadline = @definition['spec']['progressDeadlineSeconds']
        @desired_replicas = -1
        @max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
+        @observed_generation = -1


Should we set a dummy value for @current_generation too?

KnVerey requested review from dturn and klautcomputing April 4, 2018 00:33

KnVerey force-pushed the pds_fix branch from 7bbc982 to 331c0d3 Compare April 6, 2018 23:26

dturn approved these changes Apr 10, 2018

View reviewed changes

KnVerey added 2 commits April 17, 2018 19:45

Fix bug where redeploy hangs forever because PDS condition is ignored

f755412

Add comment and remove reason check

9c6af9f

KnVerey force-pushed the pds_fix branch from 9e3caf8 to 9c6af9f Compare April 17, 2018 23:45

KnVerey merged commit dcfecd4 into master Apr 17, 2018

KnVerey deleted the pds_fix branch April 17, 2018 23:57

stefanmb had a problem deploying to rubygems April 18, 2018 16:51 Failure

stefanmb deployed to rubygems April 18, 2018 16:52 Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where redeploy hangs forever because PDS condition is ignored #262

Fix bug where redeploy hangs forever because PDS condition is ignored #262

KnVerey commented Apr 4, 2018

KnVerey commented Apr 4, 2018

dturn commented Apr 4, 2018

KnVerey commented Apr 4, 2018

KnVerey commented Apr 6, 2018

KnVerey commented Apr 10, 2018

dturn Apr 4, 2018

Fix bug where redeploy hangs forever because PDS condition is ignored #262

Fix bug where redeploy hangs forever because PDS condition is ignored #262

Conversation

KnVerey commented Apr 4, 2018

KnVerey commented Apr 4, 2018

dturn commented Apr 4, 2018

KnVerey commented Apr 4, 2018

KnVerey commented Apr 6, 2018

KnVerey commented Apr 10, 2018

dturn Apr 4, 2018

Choose a reason for hiding this comment