Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/aws: 20-minute create timeouts for routes and security groups #1682

Conversation

wking
Copy link
Member

@wking wking commented Apr 26, 2019

Using operation timeouts from hashicorp/terraform-provider-aws#3599 and hashicorp/terraform-provider-aws#3639, both of which were added in v1.11, so we have them in our v2.2 AWS provider. This should mitigate some of the issues we've been having in our busy CI account, where out of ~1150 jobs in the last 24 hours, we've had the following failures (using this server):

$ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n
     2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state
    10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state
    38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state
    58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state
    76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state
    90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state
   164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state

The 20 minute timeout is much higher than the two-minute route default, so that should help a lot with our leading error. The security group default is 10 minutes, so this is less of change there, and we only see that error rarely anyway. I went with 20 minutes (instead of a higher number), because a single resource (or parallel resources) coming in just under that range will keep the full Terraform step
under the 30 minutes that we've chosen as a timeout for our other steps (waiting for the Kubernetes API, bootstrap completion, and install completion. But obviously we can tune more later if necessary.

This can happen instead of, or together with, openshift/release#3615.

Using [1,2,3,4,5], both of which were added in v1.11, so we have them
in our v2.2 AWS provider.  This should mitigate some of the issues
we've been having in our busy CI account, where out of ~1150 jobs in
the last 24 hours, we've had the following failures [6]:

  $ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n
       2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state
      10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state
      38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state
      58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state
      76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state
      90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state
     164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state

The 20 minute timeout is much higher than the two-minute route default
[2], so that should help a lot with our leading error.  The security
group default is 10 minutes [4], so this is less of change there, and
we only see that error rarely anyway.  I went with 20 minutes (instead
of a higher number), because a single resource (or parallel resources)
coming in just under that range will keep the full Terraform step
under the 30 minutes that we've chosen as a timeout for our other
steps (waiting for the Kubernetes API, bootstrap completion, and
install completion.  But obviously we can tune more later if
necessary.

[1]: https://www.terraform.io/docs/configuration/resources.html#operation-timeouts
[2]: https://www.terraform.io/docs/providers/aws/r/route.html#timeouts
[3]: hashicorp/terraform-provider-aws#3639 (v1.11.0)
[4]: https://www.terraform.io/docs/providers/aws/r/security_group.html#timeouts
[5]: hashicorp/terraform-provider-aws#3599 (v1.11.0)
[6]: https://github.com/wking/openshift-release/tree/debug-scripts/d3
@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 26, 2019
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2019
@wking
Copy link
Member Author

wking commented Apr 26, 2019

Both e2e-aws and e2e-aws-upgrade got through Terraform, so this is going to be fine.

@wking
Copy link
Member Author

wking commented Apr 26, 2019

e2e-aws-upgrade:

fail [k8s.io/kubernetes/test/e2e/framework/service_util.go:855]: Apr 26 21:57:54.982: Could not reach HTTP service through a05dd12a9686e11e9a874127f4538e9b-982179975.us-east-1.elb.amazonaws.com:80 after 2m0s

Now the test suite is breaking down on us 🤷‍♂️

/retest

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 26, 2019

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-openstack 246f4a1 link /test e2e-openstack

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit f58d513 into openshift:master Apr 27, 2019
@wking wking deleted the aws-raise-security-group-and-route-timeouts branch April 27, 2019 08:56
@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 29, 2019 via email

wking added a commit to wking/openshift-installer that referenced this pull request Aug 28, 2019
Up from their default 10 minutes, using the knob that dates back to
the original network load balancer support [1].  This should help us
avoid the [2]:

  Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)

that cropped up again this week.  20m matches the timeout we set for
routes and security groups in 246f4a1 (data/aws: 20-minute create
timeouts for routes and security groups, 2019-04-26, openshift#1682).
Sometimes even 20m will not be enough [3], but should make us a bit
more resilient anyway.

[1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
wking added a commit to wking/openshift-installer that referenced this pull request Aug 28, 2019
Up from their default 10 minutes, using the knob that dates back to
the original network load balancer support [1].  This should help us
avoid the [2]:

  Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)

that cropped up again this week.  20m matches the timeout we set for
routes and security groups in 246f4a1 (data/aws: 20-minute create
timeouts for routes and security groups, 2019-04-26, openshift#1682).
Sometimes even 20m will not be enough [3], but should make us a bit
more resilient anyway.

[1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
wking added a commit to wking/openshift-installer that referenced this pull request Sep 6, 2019
Up from their default 10 minutes, using the knob that dates back to
the original network load balancer support [1].  This should help us
avoid the [2]:

  Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)

that cropped up again this week.  20m matches the timeout we set for
routes and security groups in 246f4a1 (data/aws: 20-minute create
timeouts for routes and security groups, 2019-04-26, openshift#1682).
Sometimes even 20m will not be enough [3], but should make us a bit
more resilient anyway.

[1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
Up from their default 10 minutes, using the knob that dates back to
the original network load balancer support [1].  This should help us
avoid the [2]:

  Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s)

that cropped up again this week.  20m matches the timeout we set for
routes and security groups in 246f4a1 (data/aws: 20-minute create
timeouts for routes and security groups, 2019-04-26, openshift#1682).
Sometimes even 20m will not be enough [3], but should make us a bit
more resilient anyway.

[1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants