-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation #21032
Comments
I'm having this issue intermittently as well. I can't reliably reproduce it though. |
I get the same error. However, when I run 'terraform apply' again, it works. |
I'm also seeing this in us-east-1 today. It's intermittent since most of the time AWS doesn't take 2 minutes to build a route. @ialidzhikov - how does your team recover from this state besides manually deleting the routes that were ultimately created and trying to |
The really bizarre thing is according to the code I found the timeout should be 5 minutes. In both the provider and the vpc module the timeout is set to 5m so I'm not sure where the 2m is coming from. I didn't do a deep dive so maybe someone with more expertise can figure out what's going on here. |
According to the docs the creation timeout is 2m, whereas the deletion is 5m: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route |
Good point @BrandonALXEllisSS - that may be a mitigation but as the OP points out, this is a bug because it leaves the state broken when the route creation times out. |
Hi, I beleive the confiusion comes from the following pull request: #21161. I don't know the AWS SLA regarding creation time, but I can assure you that it takes longer than 2 minutes most of the time. |
@ccourtoyaxis are you saying that PR fixes this issue? If so I can upgrade to the latest AWS provider. It's definitely causing us issues because when it does happen it breaks the state and manual intervention is required to fix it. |
@ddvdozuki, it is quite the opposite, what I am saying is that this is the cause of the initial failure. The fact that it is unrecoverable is different problem. But the root cause of the route creation which times out is the default timeout value which changed from 5 minutes to 2 minutes, which is too short. I have excluded provider aws 3.62.0 version (the version which contains the PR) and have not had the issue since. I only did 2 deployments so far, but this week it failed almost every time. terraform {
required_version = ">= 1.0.0"
backend "s3" {
bucket = "tf-state-acs-portal"
key = "reservation.feature/terraform"
region = "us-east-1"
dynamodb_table = "acs-portal-tf-state"
encrypt = true
}
required_providers {
aws = {
"source" = "hashicorp/aws"
"version" = ">= 3.34.0, != 3.62.0"
}
}
} This being said, this issue about the failue not being stored in the state is in fact a real problem. As mentionned in earlier comments, you must delete the routes manually, otherwise on a subsequent run you get the ressource already exists error. So this get the user in a catch 22. |
@ccourtoyaxis Ok that's weird then because I first encountered this issue in 3.56 of the AWS provider and right now in us-east-1 I'm unable to deploy at all because of the timeout issue. Even deleting the route manually and re-running the terraform fails again due to timeout. I've never seen it this bad before. I upgraded to 3.62 but as you said it's still there.
In that version string does terraform default to the newest version that's not 3.62 or does it just install 3.34 since you have >=? |
@ddvdozuki I reverted to 3.38. I encountered the issue again even with 3.61. I changed my provider config as follows: required_providers {
aws = {
"source" = "hashicorp/aws"
"version" = ">= 3.34.0, <= 3.38.0"
}
} Since then I deployed 5 times without a glitch |
I am runnning into this issue as well, and it is quite cumbersome to have to manually delete created routes (or import them into the state). The 2 minutes timeout is too low and should be adjusted. |
Using latest release (3.64.2), seems timeout I specify is still being ignored & using default of 2m30, which requires manual intervention because it eventually succeeds, then complains it already exists. |
It seems that #21161 (included in 3.62.0) makes the
Previously To my understanding upgrading to terraform-provider-aws resource "aws_route" "foo" {
// ...
timeouts {
create = "5m"
}
} Of course, the general issue that route is not saved in the state when it fails to be available for the given timeout is still present. |
@nevelis are you sure that you specify create timeout and you use terraform-provider-aws |
Hitting this issue as well and we're using 3.64.2 (latest release I believe). |
Do you specify a custom create timeout and what is its value? |
@ialidzhikov the only timeout we specify is on our aws_vpc_endpoint like so:
|
@cdancy but then it's clear that you are hit (because route creation often takes longer than the default 2m timeout). Your vpc timeout is not related to the aws_route timeout, the latter can be configured since 3.62.0, you could try that. |
@dguendisch @ialidzhikov we're using latest vpc module 3.64.2 and have put together our terrafrom script as described in THIS ISSUE. We should have the default 5m timeout at this point if I'm not mistaken. Is there a way to easily configure the route timeouts other than what is noted in that ISSUE (and below)?
|
Issue may actually be solved when the release for this PR is made #21683 |
@justinretzolk I can confirm transitioning to version 3.65.0 does in fact fix the issue for me. |
@justinretzolk @benbense moving to latest and greatest did not solve the issue for us and it looks like at least for one other person as well |
@cdancy After checking with a colleague of mine seems like the issue wasn't fixed for him also, maybe it has to do with the Terraform version also? I'm running 1.0.11. |
@benbense ohhh ... yeah please respond back if that does fix things. We're on 1.0.8 |
@cdancy unfortunately it didn't help |
Temporary Solution : move to respective route table and delete the entry 0.0.0.0/0 and run terraform apply. |
@sandyvanam what do you mean by "move to respective route table"? |
Please make aws_route respect timeouts defined in configuration. its failing all the times after 2m30s. I have tried a dozen version combinations and still same result. 21 tries and 2m30s timeout. |
Try out with 3.66.0 (ref #21831). |
@ialidzhikov still failing for us:
also manifests for us like so:
I'd be up for a screen-share to show you what we're doing to see if you can spot anything. Just let me know. EDIT: is there a specific EDIT-2: ran a handful of times and we keep seeing the failure. We do see a random success every now and then but it's few and far between. |
yes. I can confirm that v3.66 has fixed this issue after @ialidzhikov changes. Big thanks to @ialidzhikov for updating route retries to 1000. cheers. |
From 2 minutes to 5 minutes to address hashicorp#21032
This functionality has been released in v3.70.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you! |
This is still the issue even with 1001 retries. The issue is that terraform although creates the route, but doesn't update the state and keeps retrying until timeout and/or retries limit. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
Terraform CLI and Terraform AWS Provider Version
terraform version - 0.12.31
provider-aws version - 3.54.0
Affected Resource(s)
Terraform Configuration Files
Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.
Debug Output
N/A
Panic Output
N/A
Expected Behavior
I would expect that terraform-provider-aws saves the route in the state and waits for it to get ready. And
terraform apply
fails when the route cannot be available (in 2m) but the route is saved in the state and no state is lost/leaked.Or alternatively terraform-provider-aws does not save the route in the state but ensures that the corresponding route (that cannot be available in 2m) is deleted afterwards.
I don't have much experience with handling of such cases, just a rough idea from my side.
Most probably the other resources handle this already in a better way and something similar is probably applicable for the aws_route.
For example I compared the subnet and route creation.
On subnet creation we call
d.SetId(subnetId)
(I think the part that ensures that the resources is saved in the state) right after the subnet create call and before the wait until the subnet gets ready. I think that with this handling we don't lose the subnet state if for example the subnet cannot get ready.terraform-provider-aws/aws/resource_aws_subnet.go
Lines 150 to 176 in 9ff6122
On the other side, on route creation we call
d.SetId(tfec2.RouteCreateID(routeTableID, destination))
after the wait until the route gets available. So if the route does not get available, we don't save the resource state.terraform-provider-aws/aws/resource_aws_route.go
Lines 233 to 253 in 9ff6122
Actual Behavior
When applying the above partial terraform configuration, we notice that the
aws_route
is not saved in the state ("leaks") when the route fails to be available (in 2m) on creation.The corresponding failure is:
Afterwards any subsequent
terraform appy
run fails with reason RouteAlreadyExists:Background: we have a lot of automation on top of
terraform apply
. Such inconsistencies cause us troubles becauseterraform apply
itself is not able to recover from such state where a resource is leaked and new one cannot be created because of it. And such cases require manual intervention to fix the terraform state and the state in the infrastructure.Steps to Reproduce
terraform apply
Ensure that the first
terraform apply
can potentially fail that the route cannot get available for 2mterraform apply
fails with reason RouteAlreadyExists.Important Factoids
N/A
References
The text was updated successfully, but these errors were encountered: