resource/iam_server_certificate: Increase deletion timeout #907

radeksimko · 2017-06-19T13:47:25Z

This is to address the following test failure:

=== RUN   TestAccAWSLBSSLNegotiationPolicy_missingLB
--- FAIL: TestAccAWSLBSSLNegotiationPolicy_missingLB (718.52s)
    testing.go:492: Error destroying resource! WARNING: Dangling resources
        may exist. The full state and error is shown below.
        
        Error: Error applying: 1 error(s) occurred:
        
        * aws_iam_server_certificate.test_cert (destroy): 1 error(s) occurred:
        
        * aws_iam_server_certificate.test_cert: DeleteConflict: Certificate: ASCAJXDEAGR64DZQSW564 is currently in use by arn:aws:elasticloadbalancing:us-west-2:*******:loadbalancer/tf-test-lb-wsxm9. Please remove it first before deleting it from IAM.
            status code: 409, request id: 200c2905-53f5-11e7-a79c-5921a71de6df
        
        State: aws_iam_server_certificate.test_cert:
          ID = ASCAJXDEAGR64DZQSW564
          arn = arn:aws:iam::*******:server-certificate/tf-acctest-xdb2fjbm3b
          certificate_body = 32c8d055f078a2a9fd053c0da56307b2041a5fc8
          certificate_chain = 62d98850b90a08fe0a8fa48803099cbab71913bf
          name = tf-acctest-xdb2fjbm3b
          path = /
          private_key = ed98e97354e5d9d1ebe5b2cb595c94ae6d4c3d69

genevieve · 2017-11-21T22:50:28Z

Would we be able to make a PR to make this timeout configurable? 15 minutes is still not enough and we would like to support the timeouts parameter.

Ninir · 2017-11-22T08:38:40Z

Hi @genevievelesperance

Yes we would :) Do you feel you could throw a PR for that?

genevieve · 2017-11-22T17:47:45Z

Yes

radeksimko · 2017-11-26T10:42:54Z

@genevievelesperance I've been previously hesitant in adding configurable timeouts to resources which have no reasons for variable creation/update/deletion times. Good reasons for variable times usually exist for EC2/db instances with varying instance sizes and similar, where we want the default timeout to cover most common instance sizes so that most users don't need to set custom timeout and allow users to cover edge-case scenarios (huge instances with long provisioning time) via configurable timeouts.

I don't see any reason for certificates to vary in the same way - unless the reason is varying volume of traffic on the LB at the time of deletion (which it may be).

I feel like 15mins is already quite high and I think we could/should do something smarter/cleaner here, similar to what we did in ALB/ELB: #1036 and #1427

I saw some of our acceptance tests are intermittently failing, so this clearly isn't a problem unique to you @genevievelesperance - for that reason I'd rather dig into the problem and figure out why it's taking so long and what can we do to avoid increasing the timeout further.

It's possible that we won't find any better/reasonable solution and we may end up just bumping the timeout, but that should come as the last option and I'm already feeling bad for raising it here in this PR.

genevieve · 2017-11-28T00:57:37Z

@radeksimko Are you suggesting that the load balancer clean up it's certificates?

The clean up of the network interface has a hardcoded 10 minute timeout in that PR. If we were to do a similar function for the certificate, is that relocating where the delete timeout is? Would that timeout be configurable since the relocation of the deletion won't change the time it takes AWS to confirm the cert has been deleted?

radeksimko · 2017-11-28T04:42:40Z

@genevievelesperance that's timeout for detaching a single network interface, not for retrying deletion on error code and hoping that things will magically detach, which is a significant difference.

Although the EC2 API is eventually consistent we know what we're waiting for in that context and the operation has high chance of succeeding - as opposed to retrying delete operation on DeleteConflict which may be a genuine conflict. Informing the user early about genuine conflict is another strong reason for avoiding blind retry & high timeouts as that would lead to poor user experience in such context (users waiting for ages to see the error).

The timeout for detaching NI isn't configurable because the default one was clearly sufficient for most users in aws_network_interface and/or elsewhere in the codebase - feel free to correct me if I'm wrong though.

I hope the explanation makes sense. I understand this is annoying. I also want to fix this, but do so without causing troubles to other users in other context. 😉

genevieve · 2017-11-28T04:49:01Z

I agree. Instead of increasing the default timeout like this PR, users that are not having genuine conflicts but are only dealing with a slow API can configure it instead of regular failures to destroy only that resource.

genevieve · 2017-11-28T20:38:09Z

We can try something similar to your ENI PR's for the cert and let you know how it goes.

genevieve · 2017-11-28T22:04:38Z

Thinking about this a bit more, I'm not sure it makes sense to couple the deletion of the certificate to the deletion of the load balancer.

There is no traffic to our load balancers in the tests that are failing. In the terraform destroy output, we see the load balancers be successfully destroyed and then it attempts to destroy the server certificate before timing out after 15 minutes. Nothing else is using the certificate in our tests.

It seems the only way it continues to retry is if it's seeing a DeleteConflict with a currently in use by arn message.

If we explicitly coupled the deletion of the certificate to the call the delete the load balancer, I don't see how we wouldn't run into the same issue.

Additionally, if we want to couple deleting the certificate to the load balancer, it may cause issues for others if the certificate is being used by something else.

cc @mcwumbly

genevieve · 2017-11-29T00:04:20Z

One more piece of information:

While our CI server was hanging, with this output:

aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m20s elapsed)
aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m30s elapsed)
aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m40s elapsed)

We went ahead and deleted the certificate manually with the aws cli, which did not return any error:

aws iam delete-server-certificate --server-certificate-name=bbl-ci-cre-5c33d6320171128234901664200000004

Then, the CI job immediately moved on and terraform destroy succeeded:

...
aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m40s elapsed)
aws_iam_server_certificate.lb_cert: Destruction complete after 7m43s

radeksimko · 2017-11-29T13:18:57Z

Thanks for providing more details. It may be an effect of IAM eventual consistency and your AWS CLI may be just connecting to a different data centre where the certificate deletion was already propagated, but it's hard to prove this.

I'm thinking we can put some extra logging in place - e.g. parse the ARN of the load balancer from the error message and subsequently ask the LB API about that load balancer. That in combination with iam/DeleteServerCertificate request/response may give us more evidence.

I took some time to also investigate this by looking into debug logs produced by failed tests and I can see the sequence of API calls is as follows:

elasticloadbalancing/DeleteLoadBalancer (200 OK)
elasticloadbalancing/DescribeLoadBalancers (400 Bad Request, LoadBalancerNotFound)
iam/DeleteServerCertificate (409 Conflict)

which suggest that the theory about eventual consistency may be right.

It seems the only way it continues to retry is if it's seeing a DeleteConflict with a currently in use by arn message.

It seems that's what we may end up doing, unfortunately - but before we do that I'd like to take a moment to add the logging as mentioned above and collect some more data before jumping to the radical solution.

@genevievelesperance Is there any more details you can provide? e.g. region of your LB, how often do you get to see this error? Does it occur with any LB configuration or do you have more tests with different configurations and it only occurs in some?

genevieve · 2017-11-30T17:45:19Z

The region is usually us-west-1. We see this every 1/3 runs. We have one LB configured with the certificate and it looks like this: https://github.com/cloudfoundry/bosh-bootloader/blob/master/terraform/aws/templates/cf_lb.tf#L154-L191

genevieve · 2017-12-01T19:29:37Z

Hey Radek, would you like to us to make a PR for logging to collect more data on both our ends?

radeksimko · 2017-12-01T21:42:31Z

@genevievelesperance PR would be greatly appreciated. Thanks.

genevieve · 2017-12-04T17:08:07Z

Hey @radeksimko. Added logging to query the lb api for the load balancer causing the delete conflict in #2533

genevieve · 2017-12-20T22:33:23Z

Hey @radeksimko. We are continuing to see this timeout. Both the aws cli and terraform requests are receiving the DeleteConflict on trying to delete the server certificate, and both are showing the load balancer as gone. Would it be possible to add the configurable delete timeout (and possibly only use this extended delete timeout in the case that the load balancer endpoint is reporting the lb as gone and we know we are only retrying in the case that the api is slow to be consistent)?

radeksimko · 2018-01-09T08:20:19Z

@genevievelesperance I'm more than happy to increase the default timeout as long as we have a debug log which proves this is necessary - hopefully with the record of the API call you kindly added in #2533

Do you mind providing a log from any of the mentioned runs (minus any secret/sensitive data)?

Thanks.

ghost · 2020-04-08T17:55:44Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

resource/iam_server_certificate: Increase deletion timeout

4c77f98

radeksimko added the bug Addresses a defect in current functionality. label Jun 19, 2017

stack72 approved these changes Jun 19, 2017

View reviewed changes

radeksimko merged commit 862aa75 into master Jun 19, 2017

radeksimko deleted the b-iam-server-cert-timeout branch June 19, 2017 14:28

genevieve mentioned this pull request Dec 4, 2017

Add logs for iam server certificate delete conflict #2533

Merged

ghost locked and limited conversation to collaborators Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource/iam_server_certificate: Increase deletion timeout #907

resource/iam_server_certificate: Increase deletion timeout #907

radeksimko commented Jun 19, 2017

genevieve commented Nov 21, 2017

Ninir commented Nov 22, 2017

genevieve commented Nov 22, 2017

radeksimko commented Nov 26, 2017

genevieve commented Nov 28, 2017 •

edited

Loading

radeksimko commented Nov 28, 2017

genevieve commented Nov 28, 2017

genevieve commented Nov 28, 2017

genevieve commented Nov 28, 2017

genevieve commented Nov 29, 2017

radeksimko commented Nov 29, 2017

genevieve commented Nov 30, 2017

genevieve commented Dec 1, 2017

radeksimko commented Dec 1, 2017

genevieve commented Dec 4, 2017

genevieve commented Dec 20, 2017 •

edited

Loading

radeksimko commented Jan 9, 2018

ghost commented Apr 8, 2020

resource/iam_server_certificate: Increase deletion timeout #907

resource/iam_server_certificate: Increase deletion timeout #907

Conversation

radeksimko commented Jun 19, 2017

genevieve commented Nov 21, 2017

Ninir commented Nov 22, 2017

genevieve commented Nov 22, 2017

radeksimko commented Nov 26, 2017

genevieve commented Nov 28, 2017 • edited Loading

radeksimko commented Nov 28, 2017

genevieve commented Nov 28, 2017

genevieve commented Nov 28, 2017

genevieve commented Nov 28, 2017

genevieve commented Nov 29, 2017

radeksimko commented Nov 29, 2017

genevieve commented Nov 30, 2017

genevieve commented Dec 1, 2017

radeksimko commented Dec 1, 2017

genevieve commented Dec 4, 2017

genevieve commented Dec 20, 2017 • edited Loading

radeksimko commented Jan 9, 2018

ghost commented Apr 8, 2020

genevieve commented Nov 28, 2017 •

edited

Loading

genevieve commented Dec 20, 2017 •

edited

Loading