Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare generations when available #325

Merged
merged 1 commit into from
Aug 8, 2018
Merged

Compare generations when available #325

merged 1 commit into from
Aug 8, 2018

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented Aug 7, 2018

Problem

I just observed a instant false positive success on a web deployment. Relevant lines:

[INFO][2018-08-07 19:27:25 +0000]	Successfully deployed in 1.9s: [...] Deployment/web
[INFO][2018-08-07 19:27:25 +0000]	Deployment/web                                    3 replicas, 3 updatedReplicas, 3 availableReplicas

The rollout definitely did not complete in less than 2s.

Checking the code made me notice that we aren't comparing metadata.generation to status.observedGeneration before using the status to determine success. This means we might be making that determination with state data, which could explain the above behaviour.

Solution

Compare metadata.Generation to status.observedGeneration before _succeeded? and _failed? determinations whenever possible. I used the API documentation to figure out which controllers actually populate the observed generation.

Since this is a controller race, I can't think of a way to make it happen for integration tests. I have added regression tests at the unit test level.

@Shopify/cloudx

@@ -24,7 +24,8 @@ def deploy_succeeded?
end

def deploy_failed?
pods.present? && pods.any?(&:deploy_failed?)
pods.present? && pods.any?(&:deploy_failed?) &&
observed_generation == current_generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not be !=? Or maybe I just misunderstood it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, and @KnVerey will hopefully correct me if I'm wrong, current_generation gets bumped when we apply the config. observed_generation gets set by the controller reporting the status. When they don't match it means that we're looking at a status for the previous config not current. E.g. we don't want to say the deployment has failed or succeeded before the status reflects the current config's state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! Thanks for the clarification


def observed_generation
return -2 unless exists?
# populating this is a best practice, but not all controllers actually do it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do you have an example of a controller that doesn't populate it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all objects have metadata generations, theoretically any controller could be setting observedGeneration, though this is only meaningful if the controller is writing a status. An example of a resource that has a status but no observedGeneration would be Job or CronJob. See also this k8s core issue. As a sidenote, in 1.11 (I think it made beta there?), CRs will have a reliable metadata.generation we can use to do this properly in our own controllers. Our generic CR success feature should make use of it when available.

Copy link
Contributor

@karanthukral karanthukral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one question for clarification but the rest looks good to me

Copy link
Contributor

@dturn dturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: is there a reasonable way to return early from sync when observed_generation != current_generation. Doesn't seem like we get anything useful when that happens.

@KnVerey
Copy link
Contributor Author

KnVerey commented Aug 8, 2018

Is there a reasonable way to return early from sync when observed_generation != current_generation. Doesn't seem like we get anything useful when that happens.

What do you have in mind that we could skip? The basic sync is just @instance_data = mediator.get_instance(type, name), and we don't know whether the generations match until we get that request. Maybe this is suggesting an additional state for our tentative state machine implementation though (names not to be taken seriously): undeployed -> (outdated) -> pending -> succeeded/failed/timed out

@dturn
Copy link
Contributor

dturn commented Aug 8, 2018

Having to put this check in every class just makes me think we're missing an abstraction layer. The sync method was an attempt but, I think you're right that the state machine is probably the best way forward.

@KnVerey
Copy link
Contributor Author

KnVerey commented Aug 8, 2018

Yeah, it definitely irked me to write that line in so many places. I could technically add that pseudo-state in our current model like this, but we might get unexpected results unless we also make everywhere that partitions resources into groups aware of the possibility:

image

WDYT?

@dturn
Copy link
Contributor

dturn commented Aug 8, 2018

I think its better as is than adding a new state to unwind later.

@KnVerey
Copy link
Contributor Author

KnVerey commented Aug 8, 2018

👍 I'll merge as-is then. I added a comment to your WIP so we don't forget about this extra state.

@KnVerey KnVerey merged commit 10a0c37 into master Aug 8, 2018
@KnVerey KnVerey deleted the observed_generation branch August 8, 2018 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants