Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r/aws_vpc_endpoint_subnet_association: Fix issue with concurrent calls to 'ModifyVpcEndpoint' #3418

Merged
merged 6 commits into from
Jul 18, 2018

Conversation

ewbankkit
Copy link
Contributor

Fixes #3382.
Requires both a lock to prevent multiple concurrent calls to ModifyVpcEndpoint and a delay between the calls to ensure success.

Acceptance tests:

make testacc TEST=./aws/ TESTARGS='-run=TestAccAwsVpcEndpointSubnetAssociation_basic'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws/ -v -run=TestAccAwsVpcEndpointSubnetAssociation_basic -timeout 120m
=== RUN   TestAccAwsVpcEndpointSubnetAssociation_basic
--- PASS: TestAccAwsVpcEndpointSubnetAssociation_basic (113.23s)
PASS
ok  	github.com/terraform-providers/terraform-provider-aws/aws	113.246s
make testacc TEST=./aws/ TESTARGS='-run=TestAccAwsVpcEndpointSubnetAssociation_multiple'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws/ -v -run=TestAccAwsVpcEndpointSubnetAssociation_multiple -timeout 120m
=== RUN   TestAccAwsVpcEndpointSubnetAssociation_multiple
--- PASS: TestAccAwsVpcEndpointSubnetAssociation_multiple (250.18s)
PASS
ok  	github.com/terraform-providers/terraform-provider-aws/aws	250.191s

@ghost ghost added the size/L Managed by automation to categorize the size of a PR. label Feb 16, 2018
@radeksimko radeksimko added bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. labels Feb 16, 2018
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ewbankkit! Thanks as usual for submitting a bug fix pull request. 😄 I left a comment below, can you let us know what you think?

// See https://github.com/terraform-providers/terraform-provider-aws/issues/3382.
// Prevent concurrent subnet association requests and delay between requests.
mk := "vpc_endpoint_subnet_association_" + endpointId
awsMutexKV.Lock(mk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we prefer to implement solutions to this problem with resource.Retry() instead of a mutex for a few reasons:

  • Handles cases where the concurrent modification might be happening outside Terraform
  • Better handles EC2 eventual consistency as we do not want to guess about the potential delay (it might be shorter than a minute for the API call to work)
  • Slightly simpler to understand logic (you just simply declare what error(s) you are retrying on with a timeout)

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to refresh my memory of what the underlying EC2 API's behavior is in this case.
I think the behavior was that the ModifyVpcEndpoint API call just hangs waiting for a response in these situations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it'd definitely be good to check the behavior on that from the debug logs. If that is the case, where I'm guessing the SDK is automatically retrying when it probably shouldn't, we should probably fix that in by adding a handler in config.go. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I manually perform the test steps in the AWS console I notice that the VPC Endpoint goes into a pending state while the subnets are being associated. The code doesn't take that into consideration. I'm going to add a WaitForState call and see if that improves things.

@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Jun 29, 2018
@ewbankkit
Copy link
Contributor Author

ewbankkit commented Jul 13, 2018

Adding a call to WaitForState did not resolve the concurrent calls issue.
However, when I look in CloudTrail at the ModifyVpcEndpoint calls that keep retrying I see single event like

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "IAMUser",
        "principalId": "XXXXXXXXXXXXXXXXXXXXX",
        "arn": "arn:aws:iam::000000000000:user/kit",
        "accountId": "000000000000",
        "accessKeyId": "XXXXXXXXXXXXXXXXXXXXX",
        "userName": "kit"
    },
    "eventTime": "2018-07-13T18:32:34Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "ModifyVpcEndpoint",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "96.255.71.4",
    "userAgent": "aws-sdk-go/1.14.24 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.7",
    "errorCode": "Server.Unavailable",
    "requestParameters": {
        "ModifyVpcEndpointRequest": {
            "AddSubnetId": {
                "tag": 1,
                "content": "subnet-3cf40270"
            },
            "VpcEndpointId": "vpce-04a7f5432dbef0961"
        }
    },
    "responseElements": null,
    "requestID": "e967ccf1-763c-41f7-8fc2-f5a203faa2a4",
    "eventID": "1b215c5f-9d67-4170-8b4e-ffad1c0503b7",
    "eventType": "AwsApiCall",
    "recipientAccountId": "000000000000"
}

followed by a sequence of events like

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "IAMUser",
        "principalId": "XXXXXXXXXXXXXXXXXXXXX",
        "arn": "arn:aws:iam::000000000000:user/kit",
        "accountId": "000000000000",
        "accessKeyId": "XXXXXXXXXXXXXXXXXXXXX",
        "userName": "kit"
    },
    "eventTime": "2018-07-13T18:32:35Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "ModifyVpcEndpoint",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "96.255.71.4",
    "userAgent": "aws-sdk-go/1.14.24 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.7",
    "errorCode": "Server.InternalError",
    "errorMessage": "An internal error has occurred",
    "requestParameters": {
        "ModifyVpcEndpointRequest": {
            "AddSubnetId": {
                "tag": 1,
                "content": "subnet-3cf40270"
            },
            "VpcEndpointId": "vpce-04a7f5432dbef0961"
        }
    },
    "responseElements": null,
    "requestID": "8bebfc56-2b7c-40d1-a234-1624533b62a2",
    "eventID": "a2764df4-e1ff-479f-9615-64478665c0e8",
    "eventType": "AwsApiCall",
    "recipientAccountId": "000000000000"
}

The Terraform log looks like

2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:56 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/ModifyVpcEndpoint Details:
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 503 Service Unavailable
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: Connection: close
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: Date: Fri, 13 Jul 2018 19:59:56 GMT
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: 
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: 
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:56 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2018-07-13T15:59:56.204-0400 [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>Unavailable</Code><Message/></Error></Errors><RequestID>436f2207-4432-4bca-b4e7-5ea20a377ea5</RequestID></Response>
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:56 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/ModifyVpcEndpoint failed, will retry, error Unavailable:
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: 	status code: 503, request id: 436f2207-4432-4bca-b4e7-5ea20a377ea5
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:56 [DEBUG] [aws-sdk-go] DEBUG: Retrying Request ec2/ModifyVpcEndpoint, attempt 1
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:56 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/ModifyVpcEndpoint Details:
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: ---[ REQUEST POST-SIGN ]-----------------------------
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: POST / HTTP/1.1
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: Host: ec2.us-east-2.amazonaws.com
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: User-Agent: aws-sdk-go/1.14.24 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.7
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: Content-Length: 110
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: Authorization: AWS4-HMAC-SHA256 Credential=XXXXXXXXXXXXXXXXXXXX/20180713/us-east-2/ec2/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date, Signature=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: Content-Type: application/x-www-form-urlencoded; charset=utf-8
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: X-Amz-Date: 20180713T195956Z
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: Accept-Encoding: gzip
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: 
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: Action=ModifyVpcEndpoint&AddSubnetId.1=subnet-3cf40270&Version=2016-11-15&VpcEndpointId=vpce-0dc643f1a3910ab36
2018-07-13T15:59:56.757-0400 [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:57 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/ModifyVpcEndpoint Details:
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 500 Internal Server Error
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: Connection: close
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: Date: Fri, 13 Jul 2018 19:59:57 GMT
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: 
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: 
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:57 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2018-07-13T15:59:57.148-0400 [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>InternalError</Code><Message>An internal error has occurred</Message></Error></Errors><RequestID>e659212a-c106-4327-a4fd-c415ff4ad05e</RequestID></Response>
2018-07-13T15:59:57.254-0400 [DEBUG] plugin.terraform-provider-aws: 2018/07/13 15:59:57 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/ModifyVpcEndpoint failed, will retry, error InternalError: An internal error has occurred
2018-07-13T15:59:57.254-0400 [DEBUG] plugin.terraform-provider-aws: 	status code: 500, request id: e659212a-c106-4327-a4fd-c415ff4ad05e

I'll investigate the handler suggestion.

@ewbankkit
Copy link
Contributor Author

ewbankkit commented Jul 13, 2018

The problem with adding something like

	client.ec2conn.Handlers.Retry.PushBack(func(r *request.Request) {
		if r.Operation.Name == "ModifyVpcEndpoint" {
			// HTTP 503 Server.Unavailable
			// HTTP 500 Server.InternalError
			if isAWSErr(r.Error, "Unavailable", "") || isAWSErr(r.Error, "InternalError", "") {
				r.Retryable = aws.Bool(false)
			}
		}
	})

to config.go is that even when the Server.Unavailable or Server.InternalError is propagated back up to the resource's code, phantom ENIs have been created in the VPC and some of them cannot be deleted.

@ghost ghost added the size/L Managed by automation to categorize the size of a PR. label Jul 13, 2018
@ewbankkit
Copy link
Contributor Author

Rebased to remove conflict, made acceptance tests region-agnostic, added configurable timeouts to aws_vpc_endpoint and aws_vpc_endpoint_subnet_association and ensured that the VPC endpoint is back in available state after modifying associated subnets.

Acceptance tests:

$ make testacc TEST=./aws/ TESTARGS='-run=TestAccAWSVpcEndpoint_'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws/ -v -run=TestAccAWSVpcEndpoint_ -timeout 120m
=== RUN   TestAccAWSVpcEndpoint_importBasic
--- PASS: TestAccAWSVpcEndpoint_importBasic (29.30s)
=== RUN   TestAccAWSVpcEndpoint_gatewayBasic
--- PASS: TestAccAWSVpcEndpoint_gatewayBasic (27.80s)
=== RUN   TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy
--- PASS: TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy (49.50s)
=== RUN   TestAccAWSVpcEndpoint_interfaceBasic
--- PASS: TestAccAWSVpcEndpoint_interfaceBasic (33.21s)
=== RUN   TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup
--- PASS: TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup (206.40s)
=== RUN   TestAccAWSVpcEndpoint_interfaceNonAWSService
--- PASS: TestAccAWSVpcEndpoint_interfaceNonAWSService (273.67s)
=== RUN   TestAccAWSVpcEndpoint_removed
--- PASS: TestAccAWSVpcEndpoint_removed (26.15s)
PASS
$ make testacc TEST=./aws/ TESTARGS='-run=TestAccAWSVpcEndpointSubnetAssociation_'
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./aws/ -v -run=TestAccAWSVpcEndpointSubnetAssociation_ -timeout 120m
=== RUN   TestAccAWSVpcEndpointSubnetAssociation_basic
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_basic (211.91s)
=== RUN   TestAccAWSVpcEndpointSubnetAssociation_multiple
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_multiple (515.38s)
PASS

@ewbankkit
Copy link
Contributor Author

If we don't want to merge this PR (waiting for Amazon to fix upstream), the non-concurrency related modifications are probably worth splitting into their own PR.

@bflad
Copy link
Contributor

bflad commented Jul 17, 2018

Hey @ewbankkit I think we would like to get this PR in with all of its goodies, but I'm confused if its okay as-is given your last comment. 😄 Thanks for all your investigative work here.

@ewbankkit
Copy link
Contributor Author

@bflad Yes, this PR is OK as-is.

@bflad bflad removed the waiting-response Maintainers are waiting on response from community or contributor. label Jul 18, 2018
@bflad bflad added this to the v1.28.0 milestone Jul 18, 2018
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much, @ewbankkit! 🚀

9 tests passed (all tests)
=== RUN   TestAccAWSVpcEndpoint_removed
--- PASS: TestAccAWSVpcEndpoint_removed (16.14s)
=== RUN   TestAccAWSVpcEndpoint_interfaceBasic
--- PASS: TestAccAWSVpcEndpoint_interfaceBasic (21.66s)
=== RUN   TestAccAWSVpcEndpoint_gatewayBasic
--- PASS: TestAccAWSVpcEndpoint_gatewayBasic (21.65s)
=== RUN   TestAccAWSVpcEndpoint_importBasic
--- PASS: TestAccAWSVpcEndpoint_importBasic (22.84s)
=== RUN   TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy
--- PASS: TestAccAWSVpcEndpoint_gatewayWithRouteTableAndPolicy (38.57s)
=== RUN   TestAccAWSVpcEndpointSubnetAssociation_basic
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_basic (188.50s)
=== RUN   TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup
--- PASS: TestAccAWSVpcEndpoint_interfaceWithSubnetAndSecurityGroup (234.26s)
=== RUN   TestAccAWSVpcEndpoint_interfaceNonAWSService
--- PASS: TestAccAWSVpcEndpoint_interfaceNonAWSService (260.89s)
=== RUN   TestAccAWSVpcEndpointSubnetAssociation_multiple
--- PASS: TestAccAWSVpcEndpointSubnetAssociation_multiple (487.58s)

@bflad bflad merged commit 1081b30 into hashicorp:master Jul 18, 2018
bflad added a commit that referenced this pull request Jul 18, 2018
@ewbankkit ewbankkit deleted the issue-3382 branch July 18, 2018 15:57
@bflad
Copy link
Contributor

bflad commented Jul 18, 2018

This has been released in version 1.28.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

@ghost
Copy link

ghost commented Apr 4, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. size/L Managed by automation to categorize the size of a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

using aws_vpc_endpoint_subnet_association with count for multiple subnet hangs
3 participants