Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry dry-run validation when no error returned #705

Merged
merged 1 commit into from
Apr 14, 2020

Conversation

benlangfeld
Copy link
Contributor

@benlangfeld benlangfeld commented Mar 5, 2020

What are you trying to accomplish with this PR?

Mitigate / fix #683

How is this accomplished?

Retry dry-run failures.

What could go wrong?

?

@benlangfeld benlangfeld requested a review from a team as a code owner March 5, 2020 14:36
@dturn
Copy link
Contributor

dturn commented Mar 9, 2020

Retry dry-run failures.

Can you explain a bit more on why not retrying on a timeout fixes this problem? That way if this comes up again we'll have something written we can look back on.

@benlangfeld
Copy link
Contributor Author

I just removed the whitelist which has the effect of retrying any kind of error, not just a timeout.

@dturn dturn requested a review from RyanBrushett March 11, 2020 21:44
@@ -582,7 +581,7 @@ def validate_with_local_dry_run(kubectl)
verb = deploy_method == :apply ? "apply" : "create"
command = [verb, "-f", file_path, "--dry-run", "--output=name"]
kubectl.run(*command, log_failure: false, output_is_sensitive: sensitive_template_content?,
retry_whitelist: [:client_timeout], attempts: 3, use_namespace: !global?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the reason for needing these whitelists was that the dry runs return a non-zero exit status when the validation ran successfully and found an invalid template. We do not want to re-run it in those cases. Before removing the whitelist, we need to prove I'm wrong about that in all supported versions of kubectl (and add a test showing it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KnVerey What sort of invalid template?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any kind? We're doing this dry run to catch when the template is malformed, and when such an error is caught, the exit status is (or at least was when we built the feature) non-zero. Dry run success means no errors found in the templates.

Copy link
Contributor Author

@benlangfeld benlangfeld Mar 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not sure how I would write that test. As a unit test on KubernetesResource digging into the guts of Kubectl to assert that it only shells out once?

I guess what you want is to separate the cases of non-zero with useful stderr (don't retry) from non-zero with empty stderr (retry). Would you be happy with a flag to Kubectl#run that we assert is passed and test it's behaviour in the Kubectl tests, or do you need a functional test for that, stubbing out the implementation detail of shelling out to kubectl in place of mocking Kubectl#run as we do right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first thing to do is to confirm what kubectl's actual behavior is, which you could do by dry-running a template with an error in it and noting the exit code. If it is indeed non-zero, then this PR needs to find another way to fix. As for the test, one idea would be to enable debug-level logging on a simple integration test with a single invalid resource. With debug-level logging enabled, we print every command we run, so you could scan the output to see what actually happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➭ cat foo.yaml
apiVersion: v1
metadata:
  name: foo
{19:25}~/Downloads [2.7.0]
➭ kubectl apply -f foo.yaml --dry-run
error: error validating "foo.yaml": error validating data: kind not set; if you choose to ignore these errors, turn validation off with --validate=false
{19:25}~/Downloads [2.7.0]
1 ↵ ➭ echo $?
1
{19:25}~/Downloads [2.7.0]
➭ kubectl apply -f foo.yaml --server-dry-run
error: error validating "foo.yaml": error validating data: kind not set; if you choose to ignore these errors, turn validation off with --validate=false
{19:26}~/Downloads [2.7.0]
1 ↵ ➭ echo $?
1

If it is indeed non-zero, then this PR needs to find another way to fix.

I think maybe you misunderstood what I'm trying to fix here, @KnVerey . Sometimes I can dry run a totally valid resource, get a non-zero exit code, and empty stderr/stdout; a bogus failure. When this happens, I want the dry run to be retried; I'm using this change on a real-world project and it has completely solved our deployment failures. The errors only happen sporadically, and are bogus, and we shouldn't fail a deployment on them.

What about if I just retry the local-dry-run fallback, so as not to repeat connections to the API server for failed server-dry-run. Would that satisfy your requirement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you're trying to accomplish, but the result of the change on this PR is retrying all validation failures, both legitimate and bogus, up to 3 times. In fact it guarantees that if you have an invalid template, the validation will be run 3 times. We have deploys for which local dry run is already noticeably slow as it is, and this change would degrade performance unacceptably. Looking at your bug report again, we could create a new error matcher that requires the string to be entirely blank, and add that to the retry whitelist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KnVerey Does that last change look ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's what I had in mind

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be sure we're not retrying every time templates are invalid

@benlangfeld benlangfeld changed the title Retry dry-run validation Retry dry-run validation when no error provided Apr 13, 2020
This can sometimes happen apparently due to a bug in Kubernetes. Retry to deal with instability.
@benlangfeld benlangfeld changed the title Retry dry-run validation when no error provided Retry dry-run validation when no error returned Apr 13, 2020
@KnVerey KnVerey requested review from KnVerey and dturn April 13, 2020 22:53
@KnVerey KnVerey merged commit 6b3b946 into Shopify:master Apr 14, 2020
@benlangfeld
Copy link
Contributor Author

Thanks @KnVerey . Is there a schedule for cutting the next release?

@KnVerey
Copy link
Contributor

KnVerey commented Apr 23, 2020

@benlangfeld we didn't have a schedule, but @RyanBrushett released it for you today 🎉

@benlangfeld
Copy link
Contributor Author

Thanks @RyanBrushett !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Server dry run sometimes fails without an error message
4 participants