-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
r/aws_vpc_endpoint_subnet_association: Fix issue with concurrent calls to 'ModifyVpcEndpoint' #3418
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ewbankkit! Thanks as usual for submitting a bug fix pull request. 😄 I left a comment below, can you let us know what you think?
// See https://github.com/terraform-providers/terraform-provider-aws/issues/3382. | ||
// Prevent concurrent subnet association requests and delay between requests. | ||
mk := "vpc_endpoint_subnet_association_" + endpointId | ||
awsMutexKV.Lock(mk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we prefer to implement solutions to this problem with resource.Retry()
instead of a mutex for a few reasons:
- Handles cases where the concurrent modification might be happening outside Terraform
- Better handles EC2 eventual consistency as we do not want to guess about the potential delay (it might be shorter than a minute for the API call to work)
- Slightly simpler to understand logic (you just simply declare what error(s) you are retrying on with a timeout)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to refresh my memory of what the underlying EC2 API's behavior is in this case.
I think the behavior was that the ModifyVpcEndpoint
API call just hangs waiting for a response in these situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, it'd definitely be good to check the behavior on that from the debug logs. If that is the case, where I'm guessing the SDK is automatically retrying when it probably shouldn't, we should probably fix that in by adding a handler in config.go
. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I manually perform the test steps in the AWS console I notice that the VPC Endpoint goes into a pending state while the subnets are being associated. The code doesn't take that into consideration. I'm going to add a WaitForState
call and see if that improves things.
Adding a call to {
"eventVersion": "1.05",
"userIdentity": {
"type": "IAMUser",
"principalId": "XXXXXXXXXXXXXXXXXXXXX",
"arn": "arn:aws:iam::000000000000:user/kit",
"accountId": "000000000000",
"accessKeyId": "XXXXXXXXXXXXXXXXXXXXX",
"userName": "kit"
},
"eventTime": "2018-07-13T18:32:34Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "ModifyVpcEndpoint",
"awsRegion": "us-east-2",
"sourceIPAddress": "96.255.71.4",
"userAgent": "aws-sdk-go/1.14.24 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.7",
"errorCode": "Server.Unavailable",
"requestParameters": {
"ModifyVpcEndpointRequest": {
"AddSubnetId": {
"tag": 1,
"content": "subnet-3cf40270"
},
"VpcEndpointId": "vpce-04a7f5432dbef0961"
}
},
"responseElements": null,
"requestID": "e967ccf1-763c-41f7-8fc2-f5a203faa2a4",
"eventID": "1b215c5f-9d67-4170-8b4e-ffad1c0503b7",
"eventType": "AwsApiCall",
"recipientAccountId": "000000000000"
} followed by a sequence of events like {
"eventVersion": "1.05",
"userIdentity": {
"type": "IAMUser",
"principalId": "XXXXXXXXXXXXXXXXXXXXX",
"arn": "arn:aws:iam::000000000000:user/kit",
"accountId": "000000000000",
"accessKeyId": "XXXXXXXXXXXXXXXXXXXXX",
"userName": "kit"
},
"eventTime": "2018-07-13T18:32:35Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "ModifyVpcEndpoint",
"awsRegion": "us-east-2",
"sourceIPAddress": "96.255.71.4",
"userAgent": "aws-sdk-go/1.14.24 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.7",
"errorCode": "Server.InternalError",
"errorMessage": "An internal error has occurred",
"requestParameters": {
"ModifyVpcEndpointRequest": {
"AddSubnetId": {
"tag": 1,
"content": "subnet-3cf40270"
},
"VpcEndpointId": "vpce-04a7f5432dbef0961"
}
},
"responseElements": null,
"requestID": "8bebfc56-2b7c-40d1-a234-1624533b62a2",
"eventID": "a2764df4-e1ff-479f-9615-64478665c0e8",
"eventType": "AwsApiCall",
"recipientAccountId": "000000000000"
} The Terraform log looks like
I'll investigate the handler suggestion. |
The problem with adding something like client.ec2conn.Handlers.Retry.PushBack(func(r *request.Request) {
if r.Operation.Name == "ModifyVpcEndpoint" {
// HTTP 503 Server.Unavailable
// HTTP 500 Server.InternalError
if isAWSErr(r.Error, "Unavailable", "") || isAWSErr(r.Error, "InternalError", "") {
r.Retryable = aws.Bool(false)
}
}
}) to |
Rebased to remove conflict, made acceptance tests region-agnostic, added configurable timeouts to Acceptance tests: $ make testacc TEST=./aws/ TESTARGS='-run=TestAccAWSVpcEndpoint_'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws/ -v -run=TestAccAWSVpcEndpoint_ -timeout 120m
=== RUN TestAccAWSVpcEndpoint_importBasic
--- PASS: TestAccAWSVpcEndpoint_importBasic (29.30s)
=== RUN TestAccAWSVpcEndpoint_gatewayBasic
--- PASS: TestAccAWSVpcEndpoint_gatewayBasic (27.80s)
=== RUN TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy
--- PASS: TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy (49.50s)
=== RUN TestAccAWSVpcEndpoint_interfaceBasic
--- PASS: TestAccAWSVpcEndpoint_interfaceBasic (33.21s)
=== RUN TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup
--- PASS: TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup (206.40s)
=== RUN TestAccAWSVpcEndpoint_interfaceNonAWSService
--- PASS: TestAccAWSVpcEndpoint_interfaceNonAWSService (273.67s)
=== RUN TestAccAWSVpcEndpoint_removed
--- PASS: TestAccAWSVpcEndpoint_removed (26.15s)
PASS $ make testacc TEST=./aws/ TESTARGS='-run=TestAccAWSVpcEndpointSubnetAssociation_'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws/ -v -run=TestAccAWSVpcEndpointSubnetAssociation_ -timeout 120m
=== RUN TestAccAWSVpcEndpointSubnetAssociation_basic
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_basic (211.91s)
=== RUN TestAccAWSVpcEndpointSubnetAssociation_multiple
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_multiple (515.38s)
PASS |
If we don't want to merge this PR (waiting for Amazon to fix upstream), the non-concurrency related modifications are probably worth splitting into their own PR. |
Hey @ewbankkit I think we would like to get this PR in with all of its goodies, but I'm confused if its okay as-is given your last comment. 😄 Thanks for all your investigative work here. |
@bflad Yes, this PR is OK as-is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much, @ewbankkit! 🚀
9 tests passed (all tests)
=== RUN TestAccAWSVpcEndpoint_removed
--- PASS: TestAccAWSVpcEndpoint_removed (16.14s)
=== RUN TestAccAWSVpcEndpoint_interfaceBasic
--- PASS: TestAccAWSVpcEndpoint_interfaceBasic (21.66s)
=== RUN TestAccAWSVpcEndpoint_gatewayBasic
--- PASS: TestAccAWSVpcEndpoint_gatewayBasic (21.65s)
=== RUN TestAccAWSVpcEndpoint_importBasic
--- PASS: TestAccAWSVpcEndpoint_importBasic (22.84s)
=== RUN TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy
--- PASS: TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy (38.57s)
=== RUN TestAccAWSVpcEndpointSubnetAssociation_basic
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_basic (188.50s)
=== RUN TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup
--- PASS: TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup (234.26s)
=== RUN TestAccAWSVpcEndpoint_interfaceNonAWSService
--- PASS: TestAccAWSVpcEndpoint_interfaceNonAWSService (260.89s)
=== RUN TestAccAWSVpcEndpointSubnetAssociation_multiple
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_multiple (487.58s)
This has been released in version 1.28.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
Fixes #3382.
Requires both a lock to prevent multiple concurrent calls to
ModifyVpcEndpoint
and a delay between the calls to ensure success.Acceptance tests: