Retry dry-run validation when no error returned #705

benlangfeld · 2020-03-05T14:36:33Z

What are you trying to accomplish with this PR?

Mitigate / fix #683

How is this accomplished?

Retry dry-run failures.

What could go wrong?

?

dturn · 2020-03-09T16:18:03Z

Retry dry-run failures.

Can you explain a bit more on why not retrying on a timeout fixes this problem? That way if this comes up again we'll have something written we can look back on.

benlangfeld · 2020-03-09T16:30:09Z

I just removed the whitelist which has the effect of retrying any kind of error, not just a timeout.

KnVerey · 2020-03-17T14:07:44Z

lib/krane/kubernetes_resource.rb

@@ -582,7 +581,7 @@ def validate_with_local_dry_run(kubectl)
      verb = deploy_method == :apply ? "apply" : "create"
      command = [verb, "-f", file_path, "--dry-run", "--output=name"]
      kubectl.run(*command, log_failure: false, output_is_sensitive: sensitive_template_content?,
-        retry_whitelist: [:client_timeout], attempts: 3, use_namespace: !global?)


IIRC the reason for needing these whitelists was that the dry runs return a non-zero exit status when the validation ran successfully and found an invalid template. We do not want to re-run it in those cases. Before removing the whitelist, we need to prove I'm wrong about that in all supported versions of kubectl (and add a test showing it).

@KnVerey What sort of invalid template?

Any kind? We're doing this dry run to catch when the template is malformed, and when such an error is caught, the exit status is (or at least was when we built the feature) non-zero. Dry run success means no errors found in the templates.

Honestly, I'm not sure how I would write that test. As a unit test on KubernetesResource digging into the guts of Kubectl to assert that it only shells out once?

I guess what you want is to separate the cases of non-zero with useful stderr (don't retry) from non-zero with empty stderr (retry). Would you be happy with a flag to Kubectl#run that we assert is passed and test it's behaviour in the Kubectl tests, or do you need a functional test for that, stubbing out the implementation detail of shelling out to kubectl in place of mocking Kubectl#run as we do right now?

The first thing to do is to confirm what kubectl's actual behavior is, which you could do by dry-running a template with an error in it and noting the exit code. If it is indeed non-zero, then this PR needs to find another way to fix. As for the test, one idea would be to enable debug-level logging on a simple integration test with a single invalid resource. With debug-level logging enabled, we print every command we run, so you could scan the output to see what actually happened.

➭ cat foo.yaml apiVersion: v1 metadata: name: foo {19:25}~/Downloads [2.7.0] ➭ kubectl apply -f foo.yaml --dry-run error: error validating "foo.yaml": error validating data: kind not set; if you choose to ignore these errors, turn validation off with --validate=false {19:25}~/Downloads [2.7.0] 1 ↵ ➭ echo $? 1 {19:25}~/Downloads [2.7.0] ➭ kubectl apply -f foo.yaml --server-dry-run error: error validating "foo.yaml": error validating data: kind not set; if you choose to ignore these errors, turn validation off with --validate=false {19:26}~/Downloads [2.7.0] 1 ↵ ➭ echo $? 1

If it is indeed non-zero, then this PR needs to find another way to fix.

I think maybe you misunderstood what I'm trying to fix here, @KnVerey . Sometimes I can dry run a totally valid resource, get a non-zero exit code, and empty stderr/stdout; a bogus failure. When this happens, I want the dry run to be retried; I'm using this change on a real-world project and it has completely solved our deployment failures. The errors only happen sporadically, and are bogus, and we shouldn't fail a deployment on them.

What about if I just retry the local-dry-run fallback, so as not to repeat connections to the API server for failed server-dry-run. Would that satisfy your requirement?

I understand what you're trying to accomplish, but the result of the change on this PR is retrying all validation failures, both legitimate and bogus, up to 3 times. In fact it guarantees that if you have an invalid template, the validation will be run 3 times. We have deploys for which local dry run is already noticeably slow as it is, and this change would degrade performance unacceptably. Looking at your bug report again, we could create a new error matcher that requires the string to be entirely blank, and add that to the retry whitelist.

@KnVerey Does that last change look ok?

Yep, that's what I had in mind

KnVerey

We need to be sure we're not retrying every time templates are invalid

This can sometimes happen apparently due to a bug in Kubernetes. Retry to deal with instability.

benlangfeld · 2020-04-15T12:35:17Z

Thanks @KnVerey . Is there a schedule for cutting the next release?

KnVerey · 2020-04-23T21:31:13Z

@benlangfeld we didn't have a schedule, but @RyanBrushett released it for you today 🎉

benlangfeld · 2020-04-23T21:38:13Z

Thanks @RyanBrushett !

benlangfeld requested a review from a team as a code owner March 5, 2020 14:36

benlangfeld mentioned this pull request Mar 9, 2020

Ignore empty dry run errors #691

Closed

dturn approved these changes Mar 11, 2020

View reviewed changes

dturn requested a review from RyanBrushett March 11, 2020 21:44

KnVerey reviewed Mar 17, 2020

View reviewed changes

KnVerey suggested changes Mar 17, 2020

View reviewed changes

benlangfeld force-pushed the empty-errors-2 branch from 8b70ec7 to 6272229 Compare April 10, 2020 22:36

benlangfeld changed the title ~~Retry dry-run validation~~ Retry dry-run validation when no error provided Apr 13, 2020

benlangfeld force-pushed the empty-errors-2 branch from 9d05d05 to 0e8368e Compare April 13, 2020 21:25

Retry dry-run validation when error is empty

b0fbc5e

This can sometimes happen apparently due to a bug in Kubernetes. Retry to deal with instability.

benlangfeld force-pushed the empty-errors-2 branch from 0e8368e to b0fbc5e Compare April 13, 2020 21:47

benlangfeld changed the title ~~Retry dry-run validation when no error provided~~ Retry dry-run validation when no error returned Apr 13, 2020

KnVerey requested review from KnVerey and dturn April 13, 2020 22:53

KnVerey approved these changes Apr 13, 2020

View reviewed changes

dturn approved these changes Apr 13, 2020

View reviewed changes

KnVerey merged commit 6b3b946 into Shopify:master Apr 14, 2020

benlangfeld deleted the empty-errors-2 branch April 15, 2020 12:38

RyanBrushett had a problem deploying to rubygems April 23, 2020 17:41 Failure

RyanBrushett had a problem deploying to rubygems April 23, 2020 17:55 Failure

RyanBrushett temporarily deployed to rubygems April 23, 2020 19:41 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry dry-run validation when no error returned #705

Retry dry-run validation when no error returned #705

benlangfeld commented Mar 5, 2020 •

edited

Loading

dturn commented Mar 9, 2020

benlangfeld commented Mar 9, 2020

KnVerey Mar 17, 2020

benlangfeld Mar 25, 2020

KnVerey Mar 30, 2020

benlangfeld Mar 30, 2020 •

edited

Loading

KnVerey Mar 31, 2020

benlangfeld Apr 10, 2020

KnVerey Apr 13, 2020

benlangfeld Apr 13, 2020

KnVerey Apr 13, 2020

KnVerey left a comment

benlangfeld commented Apr 15, 2020

KnVerey commented Apr 23, 2020

benlangfeld commented Apr 23, 2020

Retry dry-run validation when no error returned #705

Retry dry-run validation when no error returned #705

Conversation

benlangfeld commented Mar 5, 2020 • edited Loading

dturn commented Mar 9, 2020

benlangfeld commented Mar 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benlangfeld Mar 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey left a comment

Choose a reason for hiding this comment

benlangfeld commented Apr 15, 2020

KnVerey commented Apr 23, 2020

benlangfeld commented Apr 23, 2020

benlangfeld commented Mar 5, 2020 •

edited

Loading

benlangfeld Mar 30, 2020 •

edited

Loading