-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource/iam_server_certificate: Increase deletion timeout #907
Conversation
Would we be able to make a PR to make this timeout configurable? 15 minutes is still not enough and we would like to support the |
Hi @genevievelesperance Yes we would :) Do you feel you could throw a PR for that? |
Yes |
@genevievelesperance I've been previously hesitant in adding configurable timeouts to resources which have no reasons for variable creation/update/deletion times. Good reasons for variable times usually exist for EC2/db instances with varying instance sizes and similar, where we want the default timeout to cover most common instance sizes so that most users don't need to set custom timeout and allow users to cover edge-case scenarios (huge instances with long provisioning time) via configurable timeouts. I don't see any reason for certificates to vary in the same way - unless the reason is varying volume of traffic on the LB at the time of deletion (which it may be). I feel like 15mins is already quite high and I think we could/should do something smarter/cleaner here, similar to what we did in ALB/ELB: #1036 and #1427 I saw some of our acceptance tests are intermittently failing, so this clearly isn't a problem unique to you @genevievelesperance - for that reason I'd rather dig into the problem and figure out why it's taking so long and what can we do to avoid increasing the timeout further. It's possible that we won't find any better/reasonable solution and we may end up just bumping the timeout, but that should come as the last option and I'm already feeling bad for raising it here in this PR. |
@radeksimko Are you suggesting that the load balancer clean up it's certificates? The clean up of the network interface has a hardcoded 10 minute timeout in that PR. If we were to do a similar function for the certificate, is that relocating where the delete timeout is? Would that timeout be configurable since the relocation of the deletion won't change the time it takes AWS to confirm the cert has been deleted? |
@genevievelesperance that's timeout for detaching a single network interface, not for retrying deletion on error code and hoping that things will magically detach, which is a significant difference. Although the EC2 API is eventually consistent we know what we're waiting for in that context and the operation has high chance of succeeding - as opposed to retrying delete operation on The timeout for detaching NI isn't configurable because the default one was clearly sufficient for most users in I hope the explanation makes sense. I understand this is annoying. I also want to fix this, but do so without causing troubles to other users in other context. 😉 |
I agree. Instead of increasing the default timeout like this PR, users that are not having genuine conflicts but are only dealing with a slow API can configure it instead of regular failures to destroy only that resource. |
We can try something similar to your ENI PR's for the cert and let you know how it goes. |
Thinking about this a bit more, I'm not sure it makes sense to couple the deletion of the certificate to the deletion of the load balancer. There is no traffic to our load balancers in the tests that are failing. In the terraform destroy output, we see the load balancers be successfully destroyed and then it attempts to destroy the server certificate before timing out after 15 minutes. Nothing else is using the certificate in our tests. It seems the only way it continues to retry is if it's seeing a If we explicitly coupled the deletion of the certificate to the call the delete the load balancer, I don't see how we wouldn't run into the same issue. Additionally, if we want to couple deleting the certificate to the load balancer, it may cause issues for others if the certificate is being used by something else. cc @mcwumbly |
One more piece of information: While our CI server was hanging, with this output:
We went ahead and deleted the certificate manually with the aws cli, which did not return any error:
Then, the CI job immediately moved on and terraform destroy succeeded:
|
Thanks for providing more details. It may be an effect of IAM eventual consistency and your AWS CLI may be just connecting to a different data centre where the certificate deletion was already propagated, but it's hard to prove this. I'm thinking we can put some extra logging in place - e.g. parse the ARN of the load balancer from the error message and subsequently ask the LB API about that load balancer. That in combination with I took some time to also investigate this by looking into debug logs produced by failed tests and I can see the sequence of API calls is as follows:
which suggest that the theory about eventual consistency may be right.
It seems that's what we may end up doing, unfortunately - but before we do that I'd like to take a moment to add the logging as mentioned above and collect some more data before jumping to the radical solution. @genevievelesperance Is there any more details you can provide? e.g. region of your LB, how often do you get to see this error? Does it occur with any LB configuration or do you have more tests with different configurations and it only occurs in some? |
The region is usually |
Hey Radek, would you like to us to make a PR for logging to collect more data on both our ends? |
@genevievelesperance PR would be greatly appreciated. Thanks. |
Hey @radeksimko. Added logging to query the lb api for the load balancer causing the delete conflict in #2533 |
Hey @radeksimko. We are continuing to see this timeout. Both the aws cli and terraform requests are receiving the DeleteConflict on trying to delete the server certificate, and both are showing the load balancer as gone. Would it be possible to add the configurable delete timeout (and possibly only use this extended delete timeout in the case that the load balancer endpoint is reporting the lb as gone and we know we are only retrying in the case that the api is slow to be consistent)? |
@genevievelesperance I'm more than happy to increase the default timeout as long as we have a debug log which proves this is necessary - hopefully with the record of the API call you kindly added in #2533 Do you mind providing a log from any of the mentioned runs (minus any secret/sensitive data)? Thanks. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
This is to address the following test failure: