-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry AWS commands that may fail and increase insufficient timeouts #6775
Conversation
I'd just like to add that we run automated tests for most of our terraform templates and we have also seen the recent "increasing flakiness" to which @carlossg alludes makes reference. It made me realize just how many bugs there are in Terraform around AWS eventual consistency. Excited to see this get merged. |
@josh-padnick Are those results something that you could share or create discrete issues per resource where you are experiencing the issues? |
@jrnt30 I'd be happy to report the specific issues. But I'd prefer to report them in a single issue since managing duplicate issues is already a challenge. For example, here are relevant issues I found from a single search:
Maybe I'll just create one comment in #5335 per error and periodically update the comments to indicate frequency of occurrence. I'm open to other options. |
I understand that, I think it's just easier from an implementation perspective and getting PRs merged, if there is a specific resource provider that needs to be adjusted. There are many AWS providers and the fact that the API can respond considerably differently (had an What are the acceptance criteria in which that issue could be closed if we have so many different resources. I think a generalized "tracking" issue for eventual consistency would help though. |
ec2err, ok := err.(awserr.Error) | ||
if !ok { | ||
log.Printf("[INFO] RetryableError setting tags InternetGateway ID: %s %s", d.Id(), err) | ||
return resource.RetryableError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not an EC2 Error, it's a RetryableError
error? It seems like this should be the if ok
block and just fall through to the NonRetryableError
below, unless I'm mistaken here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like most of your retries follow this pattern, are you assuming this situation arrises outside of AWS, and so should be retried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied it from other files IIRC, but can change the logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was along these lines:
if it's not nil, we check to see if it's an AWS error. If it is not an AWS error, we return as non retryable. I think is correct, agree?
Hey all –
Do you have an example error/warning output of this? I don't know that I've seen many eventual consistency related issues regarding setting Tags. Aside, 5 minutes to set tags seems crazy, has anyone seen such times? I didn't see any in #5335. #6813 does mention it, but it's a fairly large case, I think there is more at play there. Specifically, both subnet and route_table have logic to poll for availability. |
@catsby the log error is the same as in #6105 The long retries were necessary during 3-4 days 3 weeks ago, since then AWS returned to normal We just don't want to have to retry terraform applies, and having the timeouts hardcoded makes it really difficult to configure, and forces to use longer ones just in case |
logs from a couple errors happening again today
|
@catsby I have addressed your comments, need to run some further tests but please let me know if the code looks good |
Is there an ETA on this? It is hindering adoption of terraform in my organization. |
Summary: There are five steps that should happen when we create a security group: 1. Create the group 2. Revoke the default egress rule 3. Add egress rules 4. Add ingress rules 5. Set tags A Terraform "create" action consists of all 5, and an "update" action consists of 3-5. This patch makes two changes: * Wrap steps 2, 3, and 4 in retry logic (1 and 5 are already wrapped) * Refactor the "create" logic so we don't recheck for SG existence between steps 2 and 3 This revision is based on the branch in the following PR: hashicorp#6775. That branch fixes the internet-gateway flakiness we've been seeing, and also adds retry logic to the "set tags" step for all AWS resources. Test Plan: make test Reviewers: hurshal, tyler, areece, carl Reviewed By: carl Subscribers: engineering-list JIRA Issues: AP-499 Differential Revision: https://grizzly.memsql.com/D16013
What happened to this PR? I'm still intermittently seeing |
AWS is lately experiencing increasing flakiness, or "eventual consistency", that forces to retry a lot of operations. Describing the resources until they show as available is not even enough as a following call to eg. setTags will fail because the resource doesn't exist.
e1c79dd
to
c52074b
Compare
Rebased against master |
@carlossg I noticed that after the rebase, the timeout on |
Summary: There are five steps that should happen when we create a security group: 1. Create the group 2. Revoke the default egress rule 3. Add egress rules 4. Add ingress rules 5. Set tags A Terraform "create" action consists of all 5, and an "update" action consists of 3-5. This patch makes two changes: * Wrap steps 2, 3, and 4 in retry logic (1 and 5 are already wrapped) * Refactor the "create" logic so we don't recheck for SG existence between steps 2 and 3 This revision is based on the branch in the following PR: hashicorp#6775. That branch fixes the internet-gateway flakiness we've been seeing, and also adds retry logic to the "set tags" step for all AWS resources. Test Plan: make test Reviewers: hurshal, tyler, areece, carl Reviewed By: carl Subscribers: engineering-list JIRA Issues: AP-499 Differential Revision: https://grizzly.memsql.com/D16013
@jwbowler I have increased it again to 4 minutes |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
AWS is lately experiencing increasing flakiness, or "eventual consistency",
that forces to retry a lot of operations.
Describing the resources until they show as available is not even enough
as a following call to eg. setTags will fail because the resource doesn't
exist.
I have added more debug logging but could be removed