Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/aws: 20-minute create timeouts for routes and security groups #1682

Commits on Apr 26, 2019

  1. data/aws: 20-minute create timeouts for routes and security groups

    Using [1,2,3,4,5], both of which were added in v1.11, so we have them
    in our v2.2 AWS provider.  This should mitigate some of the issues
    we've been having in our busy CI account, where out of ~1150 jobs in
    the last 24 hours, we've had the following failures [6]:
    
      $ curl -s 'http://localhost:8000/search?name=-e2e-aws&.&q=level%3Derror.*timeout+while+waiting+for+state' | jq -r '. | to_entries[].value[] | to_entries[].value[]' | sed 's/(i-[^)]*/(i-.../;s/(igw-[^)]*/(igw-.../;s/\(master\|nat_gw\|private_routing\|route_net\)\.[0-9]/\1.../' | sort | uniq -c | sort -n
           2 level=error msg="\t* aws_instance.master...: Error waiting for instance (i-...) to become ready: timeout while waiting for state
          10 level=error msg="\t* aws_security_group.bootstrap: timeout while waiting for state
          38 level=error msg="\t* aws_route.igw_route: Error creating route: timeout while waiting for state
          58 level=error msg="\t* aws_internet_gateway.igw: error attaching EC2 Internet Gateway (igw-...): timeout while waiting for state
          76 level=error msg="\t* aws_route_table_association.private_routing...: timeout while waiting for state
          90 level=error msg="\t* aws_route_table_association.route_net...: timeout while waiting for state
         164 level=error msg="\t* aws_route.to_nat_gw...: Error creating route: timeout while waiting for state
    
    The 20 minute timeout is much higher than the two-minute route default
    [2], so that should help a lot with our leading error.  The security
    group default is 10 minutes [4], so this is less of change there, and
    we only see that error rarely anyway.  I went with 20 minutes (instead
    of a higher number), because a single resource (or parallel resources)
    coming in just under that range will keep the full Terraform step
    under the 30 minutes that we've chosen as a timeout for our other
    steps (waiting for the Kubernetes API, bootstrap completion, and
    install completion.  But obviously we can tune more later if
    necessary.
    
    [1]: https://www.terraform.io/docs/configuration/resources.html#operation-timeouts
    [2]: https://www.terraform.io/docs/providers/aws/r/route.html#timeouts
    [3]: hashicorp/terraform-provider-aws#3639 (v1.11.0)
    [4]: https://www.terraform.io/docs/providers/aws/r/security_group.html#timeouts
    [5]: hashicorp/terraform-provider-aws#3599 (v1.11.0)
    [6]: https://github.com/wking/openshift-release/tree/debug-scripts/d3
    wking committed Apr 26, 2019
    Configuration menu
    Copy the full SHA
    246f4a1 View commit details
    Browse the repository at this point in the history