Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation #21032

Closed
ialidzhikov opened this issue Sep 24, 2021 · 36 comments · Fixed by #21531
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service.
Milestone

Comments

@ialidzhikov
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

terraform version - 0.12.31
provider-aws version - 3.54.0

Affected Resource(s)

  • aws_route

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

provider "aws" {
  access_key = var.ACCESS_KEY_ID
  secret_key = var.SECRET_ACCESS_KEY
  region     = "eu-west-1"
}

resource "aws_vpc" "vpc" {
  cidr_block           = "10.222.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true
}

resource "aws_subnet" "public_utility_z0" {
  vpc_id            = aws_vpc.vpc.id
  cidr_block        = "10.222.96.0/26"
  availability_zone = "eu-west-1a"
}

resource "aws_eip" "eip_natgw_z0" {
  vpc = true
}

resource "aws_nat_gateway" "natgw_z0" {
  allocation_id = aws_eip.eip_natgw_z0.id
  subnet_id     = aws_subnet.public_utility_z0.id
}

resource "aws_route_table" "routetable_private_utility_z0" {
  vpc_id = aws_vpc.vpc.id
}

resource "aws_route" "private_utility_z0_nat" {
  route_table_id         = aws_route_table.routetable_private_utility_z0.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.natgw_z0.id
}

Debug Output

N/A

Panic Output

N/A

Expected Behavior

I would expect that terraform-provider-aws saves the route in the state and waits for it to get ready. And terraform apply fails when the route cannot be available (in 2m) but the route is saved in the state and no state is lost/leaked.
Or alternatively terraform-provider-aws does not save the route in the state but ensures that the corresponding route (that cannot be available in 2m) is deleted afterwards.
I don't have much experience with handling of such cases, just a rough idea from my side.

Most probably the other resources handle this already in a better way and something similar is probably applicable for the aws_route.

For example I compared the subnet and route creation.

On subnet creation we call d.SetId(subnetId) (I think the part that ensures that the resources is saved in the state) right after the subnet create call and before the wait until the subnet gets ready. I think that with this handling we don't lose the subnet state if for example the subnet cannot get ready.

var err error
resp, err := conn.CreateSubnet(createOpts)
if err != nil {
return fmt.Errorf("error creating subnet: %w", err)
}
// Get the ID and store it
subnet := resp.Subnet
subnetId := aws.StringValue(subnet.SubnetId)
d.SetId(subnetId)
log.Printf("[INFO] Subnet ID: %s", subnetId)
// Wait for the Subnet to become available
log.Printf("[DEBUG] Waiting for subnet (%s) to become available", subnetId)
stateConf := &resource.StateChangeConf{
Pending: []string{ec2.SubnetStatePending},
Target: []string{ec2.SubnetStateAvailable},
Refresh: SubnetStateRefreshFunc(conn, subnetId),
Timeout: d.Timeout(schema.TimeoutCreate),
}
_, err = stateConf.WaitForState()
if err != nil {
return fmt.Errorf("error waiting for subnet (%s) to become ready: %w", d.Id(), err)
}

On the other side, on route creation we call d.SetId(tfec2.RouteCreateID(routeTableID, destination)) after the wait until the route gets available. So if the route does not get available, we don't save the resource state.

log.Printf("[DEBUG] Creating Route: %s", input)
_, err = tfresource.RetryWhenAwsErrCodeEquals(
d.Timeout(schema.TimeoutCreate),
func() (interface{}, error) {
return conn.CreateRoute(input)
},
tfec2.ErrCodeInvalidParameterException,
tfec2.ErrCodeInvalidTransitGatewayIDNotFound,
)
if err != nil {
return fmt.Errorf("error creating Route in Route Table (%s) with destination (%s): %w", routeTableID, destination, err)
}
_, err = waiter.RouteReady(conn, routeFinder, routeTableID, destination)
if err != nil {
return fmt.Errorf("error waiting for Route in Route Table (%s) with destination (%s) to become available: %w", routeTableID, destination, err)
}
d.SetId(tfec2.RouteCreateID(routeTableID, destination))

Actual Behavior

When applying the above partial terraform configuration, we notice that the aws_route is not saved in the state ("leaks") when the route fails to be available (in 2m) on creation.

The corresponding failure is:

error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {

Afterwards any subsequent terraform appy run fails with reason RouteAlreadyExists:

error creating Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0): RouteAlreadyExists: The route identified by 0.0.0.0/0 already exists.
	status code: 400, request id: <omitted>
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {

Background: we have a lot of automation on top of terraform apply. Such inconsistencies cause us troubles because terraform apply itself is not able to recover from such state where a resource is leaked and new one cannot be created because of it. And such cases require manual intervention to fix the terraform state and the state in the infrastructure.

Steps to Reproduce

  1. terraform apply

Ensure that the first terraform apply can potentially fail that the route cannot get available for 2m

error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {
  1. Ensure that the corresponding route is leaked and any subsequent terraform apply fails with reason RouteAlreadyExists.

Important Factoids

N/A

References

  • #0000
@github-actions github-actions bot added needs-triage Waiting for first response or review from a maintainer. service/ec2 Issues and PRs that pertain to the ec2 service. labels Sep 24, 2021
@ialidzhikov ialidzhikov changed the title resource/aws_route: route is not saved in the state when route fails to be available (in 2m) on creation resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation Sep 24, 2021
@justinretzolk justinretzolk added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Sep 24, 2021
@ddvdozuki
Copy link

I'm having this issue intermittently as well. I can't reliably reproduce it though.

@gtantachuco
Copy link

I get the same error. However, when I run 'terraform apply' again, it works.

@dwyerk
Copy link

dwyerk commented Oct 6, 2021

I'm also seeing this in us-east-1 today. It's intermittent since most of the time AWS doesn't take 2 minutes to build a route.

@ialidzhikov - how does your team recover from this state besides manually deleting the routes that were ultimately created and trying to apply again?

@ddvdozuki
Copy link

The really bizarre thing is according to the code I found the timeout should be 5 minutes. In both the provider and the vpc module the timeout is set to 5m so I'm not sure where the 2m is coming from. I didn't do a deep dive so maybe someone with more expertise can figure out what's going on here.

@BrandonALXEllisSS
Copy link

According to the docs the creation timeout is 2m, whereas the deletion is 5m: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route

@dwyerk
Copy link

dwyerk commented Oct 7, 2021

Good point @BrandonALXEllisSS - that may be a mitigation but as the OP points out, this is a bug because it leaves the state broken when the route creation times out.

@ccourtoyaxis
Copy link

Hi, I beleive the confiusion comes from the following pull request: #21161.
I also concur that 2m timout is too low, In our case the creation fails more than 50% of the time. I think that increasing it to 5 min is more than just a mitigation, it is a must. Otherwise, we get inconsistent results.

I don't know the AWS SLA regarding creation time, but I can assure you that it takes longer than 2 minutes most of the time.

@ddvdozuki
Copy link

@ccourtoyaxis are you saying that PR fixes this issue? If so I can upgrade to the latest AWS provider. It's definitely causing us issues because when it does happen it breaks the state and manual intervention is required to fix it.

@ccourtoyaxis
Copy link

@ddvdozuki, it is quite the opposite, what I am saying is that this is the cause of the initial failure. The fact that it is unrecoverable is different problem. But the root cause of the route creation which times out is the default timeout value which changed from 5 minutes to 2 minutes, which is too short. I have excluded provider aws 3.62.0 version (the version which contains the PR) and have not had the issue since. I only did 2 deployments so far, but this week it failed almost every time.

terraform {
  required_version = ">= 1.0.0"
  backend "s3" {
    bucket = "tf-state-acs-portal"
    key    = "reservation.feature/terraform"
    region = "us-east-1"
    dynamodb_table = "acs-portal-tf-state"
    encrypt        = true
  }

  required_providers {
    aws = {
      "source"  = "hashicorp/aws"
      "version" = ">= 3.34.0, != 3.62.0"
    }
  }
}

This being said, this issue about the failue not being stored in the state is in fact a real problem. As mentionned in earlier comments, you must delete the routes manually, otherwise on a subsequent run you get the ressource already exists error. So this get the user in a catch 22.

@ddvdozuki
Copy link

ddvdozuki commented Oct 14, 2021

@ccourtoyaxis Ok that's weird then because I first encountered this issue in 3.56 of the AWS provider and right now in us-east-1 I'm unable to deploy at all because of the timeout issue. Even deleting the route manually and re-running the terraform fails again due to timeout. I've never seen it this bad before. I upgraded to 3.62 but as you said it's still there.

  required_providers {
    aws = {
      "source"  = "hashicorp/aws"
      "version" = ">= 3.34.0, != 3.62.0"
    }
  }

In that version string does terraform default to the newest version that's not 3.62 or does it just install 3.34 since you have >=?

@ccourtoyaxis
Copy link

@ddvdozuki I reverted to 3.38. I encountered the issue again even with 3.61. I changed my provider config as follows:

required_providers {
  aws = {
    "source"  = "hashicorp/aws"
    "version" = ">= 3.34.0, <= 3.38.0"
  }
}

Since then I deployed 5 times without a glitch

@huguesalary
Copy link
Contributor

I am runnning into this issue as well, and it is quite cumbersome to have to manually delete created routes (or import them into the state).

The 2 minutes timeout is too low and should be adjusted.

@nevelis
Copy link

nevelis commented Nov 10, 2021

Using latest release (3.64.2), seems timeout I specify is still being ignored & using default of 2m30, which requires manual intervention because it eventually succeeds, then complains it already exists.

@ialidzhikov
Copy link
Contributor Author

It seems that #21161 (included in 3.62.0) makes the RouteReady func to use a configurable timeout - TimeoutCreate. TimeoutCreate still defaults to 2m but the timeout at least is configurable now.

_, err = waiter.RouteReady(conn, routeFinder, routeTableID, destination, d.Timeout(schema.TimeoutCreate))

Previously RouteReady was using a hard-coded timeout (PropagationTimeout, defaults to 2m) which was not configurable.

To my understanding upgrading to terraform-provider-aws >= 3.62.0 and specifying a high timeout should mitigate this issue:

resource "aws_route" "foo" {
  // ...  

  timeouts {
    create = "5m"
  }
}

Of course, the general issue that route is not saved in the state when it fails to be available for the given timeout is still present.

@ialidzhikov
Copy link
Contributor Author

Using latest release (3.64.2), seems timeout I specify is still being ignored & using default of 2m30, which requires manual intervention because it eventually succeeds, then complains it already exists.

@nevelis are you sure that you specify create timeout and you use terraform-provider-aws >= 3.62.0? I hope that #21032 (comment) helps you to mitigate the issue. If it doesn't, please share.

@cdancy
Copy link

cdancy commented Nov 10, 2021

Hitting this issue as well and we're using 3.64.2 (latest release I believe).

@ialidzhikov
Copy link
Contributor Author

Hitting this issue as well and we're using 3.64.2 (latest release I believe).

Do you specify a custom create timeout and what is its value?

@cdancy
Copy link

cdancy commented Nov 11, 2021

@ialidzhikov the only timeout we specify is on our aws_vpc_endpoint like so:

resource "aws_vpc_endpoint" "dynamodb" {
  count             = length(module.vpc.private_route_table_ids) > 0 ? 1 : 0
  vpc_endpoint_type = "Gateway"
  vpc_id            = module.vpc.vpc_id
  service_name      = "com.amazonaws.${data.aws_region.current.name}.dynamodb"
  tags              = local.tags
  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }

@dguendisch
Copy link

@cdancy but then it's clear that you are hit (because route creation often takes longer than the default 2m timeout). Your vpc timeout is not related to the aws_route timeout, the latter can be configured since 3.62.0, you could try that.

@cdancy
Copy link

cdancy commented Nov 11, 2021

@dguendisch @ialidzhikov we're using latest vpc module 3.64.2 and have put together our terrafrom script as described in THIS ISSUE. We should have the default 5m timeout at this point if I'm not mistaken.

Is there a way to easily configure the route timeouts other than what is noted in that ISSUE (and below)?

resource "aws_default_route_table" "rtb-default" {
  default_route_table_id = module.vpc.default_vpc_main_route_table_id
  timeouts {
    create = "30m"
    update = "30m"
  }
}

@cdancy
Copy link

cdancy commented Nov 12, 2021

Issue may actually be solved when the release for this PR is made #21683

@justinretzolk
Copy link
Member

Hi all 👋 This PR made it in to the 3.65.0 release of the provider, and as @cdancy mentioned, I'm curious as to whether that PR may have resolved this issue as well. Can someone who was previously experiencing this test with this new provider version and see if that corrected the issue?

@benbense
Copy link

@justinretzolk I can confirm transitioning to version 3.65.0 does in fact fix the issue for me.

@cdancy
Copy link

cdancy commented Nov 16, 2021

@justinretzolk @benbense moving to latest and greatest did not solve the issue for us and it looks like at least for one other person as well

@benbense
Copy link

@cdancy After checking with a colleague of mine seems like the issue wasn't fixed for him also, maybe it has to do with the Terraform version also? I'm running 1.0.11.
He's running an older version and will try to update it today.

@cdancy
Copy link

cdancy commented Nov 16, 2021

@benbense ohhh ... yeah please respond back if that does fix things. We're on 1.0.8

@benbense
Copy link

@cdancy unfortunately it didn't help

@sandyvanam
Copy link

sandyvanam commented Nov 17, 2021

Temporary Solution : move to respective route table and delete the entry 0.0.0.0/0 and run terraform apply.

@cdancy
Copy link

cdancy commented Nov 17, 2021

@sandyvanam what do you mean by "move to respective route table"?

@i-engy
Copy link

i-engy commented Nov 29, 2021

Please make aws_route respect timeouts defined in configuration. its failing all the times after 2m30s. I have tried a dozen version combinations and still same result. 21 tries and 2m30s timeout.

@ialidzhikov
Copy link
Contributor Author

Please make aws_route respect timeouts defined in configuration. its failing all the times after 2m30s. I have tried a dozen version combinations and still same result. 21 tries and 2m30s timeout.

Try out with 3.66.0 (ref #21831).

@cdancy
Copy link

cdancy commented Nov 29, 2021

@ialidzhikov still failing for us:

10:23:17          	            	    - Error: "terraform apply -auto-approve -no-color -input=false plan.json" command failed with code 1
10:23:17          	            	               
10:23:17          	            	               Error: error reading Route Table Association (rtbassoc-0bf64424a72d7ab4f): empty result
10:23:17          	            	               
10:23:17          	            	                 with module.vpc.aws_route_table_association.public[2],
10:23:17          	            	                 on .terraform/modules/vpc/main.tf line 1204, in resource "aws_route_table_association" "public":
10:23:17          	            	               1204: resource "aws_route_table_association" "public" {
10:23:17          	            	               
10:23:17          	            	               
10:23:17          	            	               Error: error reading Route Table Association (rtbassoc-05f19b75d7e7b114a): empty result
10:23:17          	            	               
10:23:17          	            	                 with module.vpc.aws_route_table_association.public[1],
10:23:17          	            	                 on .terraform/modules/vpc/main.tf line 1204, in resource "aws_route_table_association" "public":
10:23:17          	            	               1204: resource "aws_route_table_association" "public" {

also manifests for us like so:

11:21:47          	            	    - Error: "terraform apply -auto-approve -no-color -input=false plan.json" command failed with code  1
11:21:47          	            	               
11:21:47          	            	               Error: error reading Route Table (rtb-00722ec3e4d58322d): couldn't find resource
11:21:47          	            	               
11:21:47          	            	                 with module.vpc.aws_route_table.public[0],
11:21:47          	            	                 on .terraform/modules/vpc/main.tf line 203, in resource "aws_route_table" "public":
11:21:47          	            	                203: resource "aws_route_table" "public" {

I'd be up for a screen-share to show you what we're doing to see if you can spot anything. Just let me know.

EDIT: is there a specific vpc module version we should be at least using? We're on v3.6.0 and I see latest is v3.11.0. Not sure if that helps one way or another.

EDIT-2: ran a handful of times and we keep seeing the failure. We do see a random success every now and then but it's few and far between.

@i-engy
Copy link

i-engy commented Nov 30, 2021

yes. I can confirm that v3.66 has fixed this issue after @ialidzhikov changes. Big thanks to @ialidzhikov for updating route retries to 1000. cheers.

huguesalary added a commit to huguesalary/terraform-provider-aws that referenced this issue Dec 15, 2021
From 2 minutes to 5 minutes to address hashicorp#21032
@github-actions github-actions bot added this to the v3.70.0 milestone Dec 16, 2021
@github-actions
Copy link

This functionality has been released in v3.70.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@Riinkesh
Copy link

This is still the issue even with 1001 retries. The issue is that terraform although creates the route, but doesn't update the state and keeps retrying until timeout and/or retries limit.
Terraform should be able to check the route status with every retry & therefore should update the state correctly.

@github-actions
Copy link

github-actions bot commented May 6, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.