Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_route53_record: add create,update,delete timeout options for the resource #11895

Merged
merged 1 commit into from
Feb 10, 2020

Conversation

abhinavdahiya
Copy link
Contributor

@abhinavdahiya abhinavdahiya commented Feb 4, 2020

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

ChangeRecordSets now uses the AWS SDKs inbuilt retry method instead of StateConf change retry loop for 5 minutes.

Release note for CHANGELOG:

* ChangeRecordSets now uses the AWS SDKs inbuilt retry method instead of StateConf change retry loop for 5 minutes.

Output from acceptance testing:

$ make testacc TESTARGS='-run=TestAccXXX'

...

Closes #11896

@abhinavdahiya abhinavdahiya requested a review from a team February 4, 2020 22:54
@ghost ghost added size/S Managed by automation to categorize the size of a PR. needs-triage Waiting for first response or review from a maintainer. documentation Introduces or discusses updates to documentation. service/route53 Issues and PRs that pertain to the route53 service. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure. labels Feb 4, 2020
@bflad bflad added the proposal Proposes new design or functionality. label Feb 5, 2020
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abhinavdahiya 👋 Thank you for submitting this.

We have found in practice that generally when folks are asking for customizable timeouts, that usually there is actually another bug or change needed in the resource logic. This is usually due to how Terraform Plugin SDK helpers expect certain operations to complete (usually immediately) and how the AWS Go SDK causes operations to complete (potentially automatically retrying for extended periods of time without returning during throttling), but also sometimes due to the Terraform AWS Provider needing to tell the AWS Go SDK to automatically retry certain API responses that are not already retried automatically.

We went through a large effort to fix a class of these in #7873. This covered resource.Retry() cases, but did not cover (resource.StateChangeConf).WaitForState() since we had received less of these reports in practice and the fix implementation is markedly different. The fix in those cases was to retry one last time without time bounding so operators do not need to manually figure the appropriate timeframe out in their environment and can instead rely on the AWS Go SDK max retries (still configurable, but by default retries for roughly an hour on throttling).

In this specific case though, we instead can add the PriorRequestNotComplete error code as a retryable error code for the ChangeResourceRecordSets operation in aws/config.go. An example of how this was done for a similar issue with a different resource can be found here:

https://github.com/terraform-providers/terraform-provider-aws/blob/3c84f53f15b4e4d4b9cc744478d37c97e9b9166e/aws/config.go#L591-L597

Something like the below should get us close:

	client.r53conn.Handlers.Retry.PushBack(func(r *request.Request) {
		if r.Operation.Name == "ChangeResourceRecordSets" {
			if isAWSErr(r.Error, route53.ErrCodePriorRequestNotComplete, "") {
				r.Retryable = aws.Bool(true)
			}
		}
	})

Then we can remove the changeRoute53RecordSet usage of StateChangeConf and associated 5 minute timeout completely. If you can change this implementation to the above, we can get this in to hopefully solve your (and others) problem without requiring operator intervention. Thanks!

@bflad bflad self-assigned this Feb 5, 2020
@bflad bflad added waiting-response Maintainers are waiting on response from community or contributor. and removed needs-triage Waiting for first response or review from a maintainer. labels Feb 5, 2020
@ghost ghost added size/M Managed by automation to categorize the size of a PR. and removed size/S Managed by automation to categorize the size of a PR. labels Feb 7, 2020
@abhinavdahiya
Copy link
Contributor Author

Hi @bflad ! thanks for the detailed response and the alternative implementation suggestion..

When i started looking into the adding PriorRequestNotComplete as retryable error, i found out that it was already part of the throttle code set in SDK.. https://github.com/aws/aws-sdk-go/blame/3acad1210f5c4feb4866ff9874b5f3f5895bd96d/aws/request/retryer.go#L93 and has been for some time now.

So the logic from #7873 seemed like a better way to solve the timeout issue. Therefore i updated the functions to use that workflow, allows aws eventual consistency to kick in and then depend on SDK's internal backoff for PriorRequestNotComplete.

PTAL.

@ghost ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label Feb 7, 2020
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those updates, @abhinavdahiya 😄 Two more minor things and this should be good to go. 👍

Comment on lines 748 to 763
var out *route53.ChangeResourceRecordSetsOutput
err := resource.Retry(1*time.Minute, func() *resource.RetryError {
var err error
out, err = conn.ChangeResourceRecordSets(input)
if isAWSErr(err, "InvalidChangeBatch", "") {
// This means the record is already gone
return nil
}
return resource.NonRetryableError(err)
})
if isResourceTimeoutError(err) {
out, err = conn.ChangeResourceRecordSets(input)
if isAWSErr(err, "InvalidChangeBatch", "") {
// This means the record is already gone
return out, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified without resource.Retry() at all since there are no retryable conditions 👍

out, err := conn.ChangeResourceRecordSets(input)

if isAWSErr(err, "InvalidChangeBatch", "") {
  return out, nil
}

return out, err

See also: #11864

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bflad

+1

fixed with f5cf401

log.Print("[DEBUG] Hosted Zone not found, retrying...")
return resource.RetryableError(err)
}
return resource.NonRetryableError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In the future we will likely require the usage of resource.NonRetryableError()/resource.RetryableError() with non-nil errors:

Suggested change
return resource.NonRetryableError(err)
if err != nil {
return resource.NonRetryableError(err)
}
return nil

See also: hashicorp/terraform-plugin-sdk#199

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
fixed with f5cf401

@bflad bflad removed the proposal Proposes new design or functionality. label Feb 7, 2020
…DK backoff

`PriorRequestNotComplete` is already part of the list of throttle codes [1], so if we use the SDK inbuilt backoff we shouldn't need custom state change logic that retries for 5 minutes.

So this changes the changeRoute53RecordSet to resource.Retry and last retryOnTimeout from issue 7873 [2], such that we send request in resource.Retry for 1 minute and if there is a timeout, we send another request that then uses the
AWS SDK default retry logic...

while the deleteRoute53RecordSet now directly makes the request using SDK without any external retries.

[1]: https://github.com/aws/aws-sdk-go/blob/3acad1210f5c4feb4866ff9874b5f3f5895bd96d/aws/request/retryer.go#L93
[2]: hashicorp#7873
@bflad bflad added this to the v2.49.0 milestone Feb 10, 2020
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @abhinavdahiya 🚀

Output from acceptance testing:

--- PASS: TestAccAWSRoute53Record_disappears (138.04s)
--- PASS: TestAccAWSRoute53Record_Alias_S3 (140.04s)
--- PASS: TestAccAWSRoute53Record_txtSupport (144.45s)
--- PASS: TestAccAWSRoute53Record_caaSupport (157.92s)
--- PASS: TestAccAWSRoute53Record_generatesSuffix (160.83s)
--- PASS: TestAccAWSRoute53Record_Alias_Elb (166.20s)
--- PASS: TestAccAWSRoute53Record_underscored (169.23s)
--- PASS: TestAccAWSRoute53Record_Alias_Uppercase (178.33s)
--- PASS: TestAccAWSRoute53Record_basic_fqdn (184.78s)
--- PASS: TestAccAWSRoute53Record_geolocation_basic (184.91s)
--- PASS: TestAccAWSRoute53Record_failover (192.05s)
--- PASS: TestAccAWSRoute53Record_weighted_basic (197.08s)
--- PASS: TestAccAWSRoute53Record_latency_basic (197.18s)
--- PASS: TestAccAWSRoute53Record_weighted_to_simple_basic (200.10s)
--- PASS: TestAccAWSRoute53Record_spfSupport (202.82s)
--- PASS: TestAccAWSRoute53Record_basic (212.78s)
--- PASS: TestAccAWSRoute53Record_wildcard (223.55s)
--- PASS: TestAccAWSRoute53Record_disappears_MultipleRecords (271.22s)
--- PASS: TestAccAWSRoute53Record_empty (119.67s)
--- PASS: TestAccAWSRoute53Record_longTXTrecord (122.34s)
--- PASS: TestAccAWSRoute53Record_multivalue_answer_basic (126.24s)
--- PASS: TestAccAWSRoute53Record_AliasChange (159.71s)
--- PASS: TestAccAWSRoute53Record_TypeChange (167.90s)
--- PASS: TestAccAWSRoute53Record_SetIdentifierChange (168.08s)
--- PASS: TestAccAWSRoute53Record_allowOverwrite (159.12s)
--- PASS: TestAccAWSRoute53Record_weighted_alias (351.02s)
--- PASS: TestAccAWSRoute53Record_Alias_VpcEndpoint (485.16s)

@bflad bflad added the bug Addresses a defect in current functionality. label Feb 10, 2020
@bflad bflad merged commit fd754cb into hashicorp:master Feb 10, 2020
bflad added a commit that referenced this pull request Feb 10, 2020
@abhinavdahiya abhinavdahiya deleted the aws_record_timeout branch February 10, 2020 23:51
@ghost
Copy link

ghost commented Feb 14, 2020

This has been released in version 2.49.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

abhinavdahiya added a commit to abhinavdahiya/installer that referenced this pull request Feb 19, 2020
brings in fix for aws_record timeout [1] and other fixes mentioned in CHANGELOG [2]

updated using

```
go get github.com/terraform-providers/terraform-provider-aws@f0f304894df67616dfbd675bc9687a7db266ad41
```

Using the tag failed with error
```
go get github.com/terraform-providers/terraform-provider-aws@v2.49.0
go: finding github.com/terraform-providers/terraform-provider-aws v2.49.0
go: finding github.com/terraform-providers/terraform-provider-aws v2.49.0
go get github.com/terraform-providers/terraform-provider-aws@v2.49.0: github.com/terraform-providers/terraform-provider-aws@v2.49.0: invalid version: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v2
```

Also removes the indirect dependecy on `aws-iam-authenticator` because of [3]

[1]: hashicorp/terraform-provider-aws#11895
[2]: https://github.com/terraform-providers/terraform-provider-aws/blob/v2.49.0/CHANGELOG.md#2490-february-14-2020
[3]: hashicorp/terraform-provider-aws#11822
abhinavdahiya added a commit to abhinavdahiya/installer that referenced this pull request Feb 19, 2020
brings in fix for aws_record timeout [1] and other fixes mentioned in CHANGELOG [2]

updated using

```
go get github.com/terraform-providers/terraform-provider-aws@f0f304894df67616dfbd675bc9687a7db266ad41
```

Using the tag failed with error
```
go get github.com/terraform-providers/terraform-provider-aws@v2.49.0
go: finding github.com/terraform-providers/terraform-provider-aws v2.49.0
go: finding github.com/terraform-providers/terraform-provider-aws v2.49.0
go get github.com/terraform-providers/terraform-provider-aws@v2.49.0: github.com/terraform-providers/terraform-provider-aws@v2.49.0: invalid version: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v2
```

Also removes the indirect dependecy on `aws-iam-authenticator` because of [3]

[1]: hashicorp/terraform-provider-aws#11895
[2]: https://github.com/terraform-providers/terraform-provider-aws/blob/v2.49.0/CHANGELOG.md#2490-february-14-2020
[3]: hashicorp/terraform-provider-aws#11822
LorbusChris pushed a commit to LorbusChris/installer that referenced this pull request Feb 22, 2020
brings in fix for aws_record timeout [1] and other fixes mentioned in CHANGELOG [2]

updated using

```
go get github.com/terraform-providers/terraform-provider-aws@f0f304894df67616dfbd675bc9687a7db266ad41
```

Using the tag failed with error
```
go get github.com/terraform-providers/terraform-provider-aws@v2.49.0
go: finding github.com/terraform-providers/terraform-provider-aws v2.49.0
go: finding github.com/terraform-providers/terraform-provider-aws v2.49.0
go get github.com/terraform-providers/terraform-provider-aws@v2.49.0: github.com/terraform-providers/terraform-provider-aws@v2.49.0: invalid version: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v2
```

Also removes the indirect dependecy on `aws-iam-authenticator` because of [3]

[1]: hashicorp/terraform-provider-aws#11895
[2]: https://github.com/terraform-providers/terraform-provider-aws/blob/v2.49.0/CHANGELOG.md#2490-february-14-2020
[3]: hashicorp/terraform-provider-aws#11822
@ghost
Copy link

ghost commented Mar 27, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. documentation Introduces or discusses updates to documentation. service/route53 Issues and PRs that pertain to the route53 service. size/M Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws_route53_record: timeouts out creating record in a busy route53 zone
2 participants