-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry dry-run validation when no error returned #705
Conversation
Can you explain a bit more on why not retrying on a timeout fixes this problem? That way if this comes up again we'll have something written we can look back on. |
I just removed the whitelist which has the effect of retrying any kind of error, not just a timeout. |
@@ -582,7 +581,7 @@ def validate_with_local_dry_run(kubectl) | |||
verb = deploy_method == :apply ? "apply" : "create" | |||
command = [verb, "-f", file_path, "--dry-run", "--output=name"] | |||
kubectl.run(*command, log_failure: false, output_is_sensitive: sensitive_template_content?, | |||
retry_whitelist: [:client_timeout], attempts: 3, use_namespace: !global?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the reason for needing these whitelists was that the dry runs return a non-zero exit status when the validation ran successfully and found an invalid template. We do not want to re-run it in those cases. Before removing the whitelist, we need to prove I'm wrong about that in all supported versions of kubectl (and add a test showing it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KnVerey What sort of invalid template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any kind? We're doing this dry run to catch when the template is malformed, and when such an error is caught, the exit status is (or at least was when we built the feature) non-zero. Dry run success means no errors found in the templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm not sure how I would write that test. As a unit test on KubernetesResource
digging into the guts of Kubectl
to assert that it only shells out once?
I guess what you want is to separate the cases of non-zero with useful stderr (don't retry) from non-zero with empty stderr (retry). Would you be happy with a flag to Kubectl#run
that we assert is passed and test it's behaviour in the Kubectl
tests, or do you need a functional test for that, stubbing out the implementation detail of shelling out to kubectl
in place of mocking Kubectl#run
as we do right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first thing to do is to confirm what kubectl's actual behavior is, which you could do by dry-running a template with an error in it and noting the exit code. If it is indeed non-zero, then this PR needs to find another way to fix. As for the test, one idea would be to enable debug-level logging on a simple integration test with a single invalid resource. With debug-level logging enabled, we print every command we run, so you could scan the output to see what actually happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➭ cat foo.yaml
apiVersion: v1
metadata:
name: foo
{19:25}~/Downloads [2.7.0]
➭ kubectl apply -f foo.yaml --dry-run
error: error validating "foo.yaml": error validating data: kind not set; if you choose to ignore these errors, turn validation off with --validate=false
{19:25}~/Downloads [2.7.0]
1 ↵ ➭ echo $?
1
{19:25}~/Downloads [2.7.0]
➭ kubectl apply -f foo.yaml --server-dry-run
error: error validating "foo.yaml": error validating data: kind not set; if you choose to ignore these errors, turn validation off with --validate=false
{19:26}~/Downloads [2.7.0]
1 ↵ ➭ echo $?
1
If it is indeed non-zero, then this PR needs to find another way to fix.
I think maybe you misunderstood what I'm trying to fix here, @KnVerey . Sometimes I can dry run a totally valid resource, get a non-zero exit code, and empty stderr/stdout; a bogus failure. When this happens, I want the dry run to be retried; I'm using this change on a real-world project and it has completely solved our deployment failures. The errors only happen sporadically, and are bogus, and we shouldn't fail a deployment on them.
What about if I just retry the local-dry-run fallback, so as not to repeat connections to the API server for failed server-dry-run. Would that satisfy your requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you're trying to accomplish, but the result of the change on this PR is retrying all validation failures, both legitimate and bogus, up to 3 times. In fact it guarantees that if you have an invalid template, the validation will be run 3 times. We have deploys for which local dry run is already noticeably slow as it is, and this change would degrade performance unacceptably. Looking at your bug report again, we could create a new error matcher that requires the string to be entirely blank, and add that to the retry whitelist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KnVerey Does that last change look ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's what I had in mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be sure we're not retrying every time templates are invalid
8b70ec7
to
6272229
Compare
9d05d05
to
0e8368e
Compare
This can sometimes happen apparently due to a bug in Kubernetes. Retry to deal with instability.
0e8368e
to
b0fbc5e
Compare
Thanks @KnVerey . Is there a schedule for cutting the next release? |
@benlangfeld we didn't have a schedule, but @RyanBrushett released it for you today 🎉 |
Thanks @RyanBrushett ! |
What are you trying to accomplish with this PR?
Mitigate / fix #683
How is this accomplished?
Retry dry-run failures.
What could go wrong?
?