Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tags should retry without time bounds on EC2 throttling #3586

Merged
merged 4 commits into from
Nov 14, 2018

Conversation

domdom82
Copy link
Contributor

@domdom82 domdom82 commented Mar 1, 2018

this PR addresses the "security_group timeout due to tag timeout" part of issue #3128

since tags are not resources in the sense of terraform, they have no configurable timeouts per se.

in order to avoid hard-coded timeouts on tags, I have provided this PR which tries to use the Update timeout of the resource that is being tagged. If no timeout is defined for that resource, the regular default of ResourceData is used. This should at least provide some means of configuring timeouts on tags via the to-be-tagged resource.

@ghost ghost added the size/XS Managed by automation to categorize the size of a PR. label Mar 1, 2018
@bflad bflad added bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. labels Mar 1, 2018
@domdom82
Copy link
Contributor Author

domdom82 commented Mar 8, 2018

hi @bflad any chance we can get this in for 1.11 ? timeouts on tags are hitting us pretty hard these days.

@bflad
Copy link
Contributor

bflad commented Mar 8, 2018

Can you provide debug logs that show that you're hitting EC2 rate limiting and not masking some other error?

@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Mar 8, 2018
@domdom82
Copy link
Contributor Author

domdom82 commented Mar 8, 2018

@bflad sure can. I also described in #3128 that we are hitting a 5 minute timeout on a security_group create but its timeout is at 10 minutes:

* aws_security_group.slave: timeout while waiting for state to become 'success' (timeout: 5m0s)

So when digging deeper we found the hard-coded timeout of 5 minutes on tags.setTags:

// Set tags
if len(remove) > 0 {
   err := resource.Retry(5*time.Minute, func() *resource.RetryError {

So then we bumped the hard-coded timeout to 10 minutes - same as the security_group itself - for testing. And it worked just fine repeatedly. This got me to think we could make this a bit smarter than just bumping a hard timeout and instead make it dependent on the resource that wants to be tagged.

The main issue I see currently is that people run into timeouts on certain resources, then bump their timeouts to fix it but then wonder why their deployment still fails because there is another "hidden" timeout on the tag of their resource which they cannot change atm.

@2rs2ts
Copy link
Contributor

2rs2ts commented Mar 13, 2018

It seems to me that this effectively doubles the timeout setting you're actually using for the resource. What do you think of taking the timeout from the schema, minus the time elapsed since initiating the create/update function, and use that for the tag timeout?

@domdom82
Copy link
Contributor Author

@2rs2ts good points. How would you pass the start time? In the ResourceData.meta map?

@2rs2ts
Copy link
Contributor

2rs2ts commented Mar 14, 2018

@domdom82 I don't know, I'm not really familiar with the code, I just thought of the idea. Sorry I'm not of much help 😅

@mildred
Copy link
Contributor

mildred commented Mar 22, 2018

edit: I was probably mistaken to post this debug output or this PR. I have in fact a related problem but that do not appear to be exactly the same. See #3128 (comment)

@bflad I cannot share you the full debug logs I have (since it contains credentials) but I have the following terraform error:

* aws_security_group_rule.base_sg_ingress_services: Error finding matching ingress Security Group Rule (sgrule-4087802998) for Group sg-8eb99cf4

The debug logs tell me terraform performs the following request:

2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [INFO] Security Group ID: sg-f284a188
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for Security Group (sg-f284a188) to exist
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for state to become: [exists]
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ REQUEST POST-SIGN ]-----------------------------
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: POST / HTTP/1.1
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Host: ec2.eu-west-1.amazonaws.com
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: User-Agent: aws-sdk-go/1.12.62 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.2
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Length: 70
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Type: application/x-www-form-urlencoded; charset=utf-8
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: X-Amz-Date: 20180322T050116Z
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Accept-Encoding: gzip
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Action=DescribeSecurityGroups&GroupId.1=sg-8eb99cf4&Version=2016-11-15
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------

And the next response in the logs for ec2/DescribeSecurityGroups I get is:

2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ RESPONSE ]--------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: HTTP/1.1 503 Service Unavailable
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Connection: close
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Transfer-Encoding: chunked
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Date: Thu, 22 Mar 2018 05:01:16 GMT
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Server: AmazonEC2
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>936c5bbe-1957-4911-a928-8d08d3b7d1d3</RequestID></Response>

I would say this confirms the rate limiting is causing the error

@domdom82
Copy link
Contributor Author

domdom82 commented Apr 4, 2018

@mildred same here. I think it is not the tag creation itself because tags are very small entities that don't take long to create, however if you are rate throttled while you are creating multiple resources at a time (in my case many security groups along with rules and tags) it can happen that you run into an early timeout (in my case the hard-coded 5 minutes on tags) - even though you might have set a longer timeout on the parent resource (e.g. 10 minutes on security groups).

@2rs2ts
Copy link
Contributor

2rs2ts commented Jul 3, 2018

bump, what's the status of this PR?

@domdom82
Copy link
Contributor Author

domdom82 commented Jul 4, 2018

@2rs2ts I'd love to see it merged. Tag timeouts are one of the most annoying things in our CI pipeline right now. It happens especially often on large sets of security groups getting deployed in one TF file.
Since @stjimmy88 approved I don't know what's blocking this merge tbh.

@domdom82
Copy link
Contributor Author

domdom82 commented Nov 2, 2018

@bflad bump for merge

@bflad
Copy link
Contributor

bflad commented Nov 9, 2018

Hi @domdom82 👋 Sorry for the delayed response here.

In #6409, we introduce a helper function (isResourceTimeoutError(err)) that checks to see if the error returned by resource.Retry() is strictly just a timeout error based on time like when the SDK is stuck its own retry logic and never returns (e.g. throttling errors). We could leverage that new function in this scenario here by calling it after the current time-based retry loop to retry the call one last time. This effectively removes the time element (which is guesswork on the operators part) and switches it to SDK-based retries for throttling (we default to 20 which will backoff in excess of 30 minutes in many cases, but is configurable via max_retries per-provider).

What do you think?

@domdom82
Copy link
Contributor Author

domdom82 commented Nov 9, 2018

@bflad I think this is a great idea. Ideally, I wouldn't have to configure timeouts on a per-resource basis but only have a provider-level setting. As you said it, it is guesswork by the operator to tweak those timeouts manually and there is never the right setting.

@bflad
Copy link
Contributor

bflad commented Nov 9, 2018

isResourceTimeoutError() is merged and available now - would you mind tweaking this PR and we can get this into the next release? If you don't have time, no big deal, I can add a commit after yours too. Thanks so much for your help and hopefully this gets less annoying to workaround. 😄

@bflad bflad added this to the v1.44.0 milestone Nov 9, 2018
@ghost ghost added size/XL Managed by automation to categorize the size of a PR. provider Pertains to the provider itself, rather than any interaction with AWS. service/acmpca Issues and PRs that pertain to the acmpca service. service/apigateway Issues and PRs that pertain to the apigateway service. and removed size/XS Managed by automation to categorize the size of a PR. labels Nov 12, 2018
@bflad bflad removed service/elbv2 Issues and PRs that pertain to the elbv2 service. service/emr Issues and PRs that pertain to the emr service. service/firehose Issues and PRs that pertain to the firehose service. service/gamelift Issues and PRs that pertain to the gamelift service. service/glue Issues and PRs that pertain to the glue service. service/neptune Issues and PRs that pertain to the neptune service. service/opsworks Issues and PRs that pertain to the opsworks service. service/rds Issues and PRs that pertain to the rds service. service/redshift Issues and PRs that pertain to the redshift service. service/route53 Issues and PRs that pertain to the route53 service. service/s3 Issues and PRs that pertain to the s3 service. waiting-response Maintainers are waiting on response from community or contributor. labels Nov 14, 2018
aws/tags.go Outdated Show resolved Hide resolved
@ghost ghost added size/S Managed by automation to categorize the size of a PR. and removed size/M Managed by automation to categorize the size of a PR. labels Nov 14, 2018
@domdom82 domdom82 changed the title tags should inherit timeout from tagged resources tags should retry without time bounds on EC2 throttling Nov 14, 2018
@domdom82
Copy link
Contributor Author

@bflad LGTM? I also renamed the PR to match the code change more accurately.

aws/tags.go Show resolved Hide resolved
@domdom82
Copy link
Contributor Author

bumped the beast a final time 🤞

Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @domdom82! 🚀 (We could return early on !isResourceTimeoutError() to remove the additional nesting but that's more of a nitpick)

(Test failures unrelated)

Tests failed: 2, passed: 245

@bflad bflad merged commit c3c296c into hashicorp:master Nov 14, 2018
bflad added a commit that referenced this pull request Nov 14, 2018
@bflad
Copy link
Contributor

bflad commented Nov 15, 2018

This has been released in version 1.44.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

@ghost
Copy link

ghost commented Apr 2, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. size/S Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants