-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data/aws: 20-minute create timeouts for routes and security groups #1682
data/aws: 20-minute create timeouts for routes and security groups #1682
Conversation
Using [1,2,3,4,5], both of which were added in v1.11, so we have them in our v2.2 AWS provider. This should mitigate some of the issues we've been having in our busy CI account, where out of ~1150 jobs in the last 24 hours, we've had the following failures [6]: $ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n 2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state 10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state 38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state 58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state 76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state 90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state 164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state The 20 minute timeout is much higher than the two-minute route default [2], so that should help a lot with our leading error. The security group default is 10 minutes [4], so this is less of change there, and we only see that error rarely anyway. I went with 20 minutes (instead of a higher number), because a single resource (or parallel resources) coming in just under that range will keep the full Terraform step under the 30 minutes that we've chosen as a timeout for our other steps (waiting for the Kubernetes API, bootstrap completion, and install completion. But obviously we can tune more later if necessary. [1]: https://www.terraform.io/docs/configuration/resources.html#operation-timeouts [2]: https://www.terraform.io/docs/providers/aws/r/route.html#timeouts [3]: hashicorp/terraform-provider-aws#3639 (v1.11.0) [4]: https://www.terraform.io/docs/providers/aws/r/security_group.html#timeouts [5]: hashicorp/terraform-provider-aws#3599 (v1.11.0) [6]: https://github.com/wking/openshift-release/tree/debug-scripts/d3
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Both e2e-aws and e2e-aws-upgrade got through Terraform, so this is going to be fine. |
Now the test suite is breaking down on us 🤷♂️ /retest |
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This seemed to make a significant difference in install flakes. Tomorrow
will tell whether it holds up when everyone is trying to merge.
…On Sat, Apr 27, 2019 at 4:13 AM OpenShift Merge Robot < ***@***.***> wrote:
Merged #1682 <#1682> into
master.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#1682 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAI37J7R4SVBUSBXSXB22NDPSQDKTANCNFSM4HI2JIJQ>
.
|
Up from their default 10 minutes, using the knob that dates back to the original network load balancer support [1]. This should help us avoid the [2]: Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s) that cropped up again this week. 20m matches the timeout we set for routes and security groups in 246f4a1 (data/aws: 20-minute create timeouts for routes and security groups, 2019-04-26, openshift#1682). Sometimes even 20m will not be enough [3], but should make us a bit more resilient anyway. [1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
Up from their default 10 minutes, using the knob that dates back to the original network load balancer support [1]. This should help us avoid the [2]: Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s) that cropped up again this week. 20m matches the timeout we set for routes and security groups in 246f4a1 (data/aws: 20-minute create timeouts for routes and security groups, 2019-04-26, openshift#1682). Sometimes even 20m will not be enough [3], but should make us a bit more resilient anyway. [1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
Up from their default 10 minutes, using the knob that dates back to the original network load balancer support [1]. This should help us avoid the [2]: Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s) that cropped up again this week. 20m matches the timeout we set for routes and security groups in 246f4a1 (data/aws: 20-minute create timeouts for routes and security groups, 2019-04-26, openshift#1682). Sometimes even 20m will not be enough [3], but should make us a bit more resilient anyway. [1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
Using operation timeouts from hashicorp/terraform-provider-aws#3599 and hashicorp/terraform-provider-aws#3639, both of which were added in v1.11, so we have them in our v2.2 AWS provider. This should mitigate some of the issues we've been having in our busy CI account, where out of ~1150 jobs in the last 24 hours, we've had the following failures (using this server):
The 20 minute timeout is much higher than the two-minute route default, so that should help a lot with our leading error. The security group default is 10 minutes, so this is less of change there, and we only see that error rarely anyway. I went with 20 minutes (instead of a higher number), because a single resource (or parallel resources) coming in just under that range will keep the full Terraform step
under the 30 minutes that we've chosen as a timeout for our other steps (waiting for the Kubernetes API, bootstrap completion, and install completion. But obviously we can tune more later if necessary.
This can happen instead of, or together with, openshift/release#3615.