resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation #21032

ialidzhikov · 2021-09-24T09:54:10Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

terraform version - 0.12.31
provider-aws version - 3.54.0

Affected Resource(s)

aws_route

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

provider "aws" {
  access_key = var.ACCESS_KEY_ID
  secret_key = var.SECRET_ACCESS_KEY
  region     = "eu-west-1"
}

resource "aws_vpc" "vpc" {
  cidr_block           = "10.222.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true
}

resource "aws_subnet" "public_utility_z0" {
  vpc_id            = aws_vpc.vpc.id
  cidr_block        = "10.222.96.0/26"
  availability_zone = "eu-west-1a"
}

resource "aws_eip" "eip_natgw_z0" {
  vpc = true
}

resource "aws_nat_gateway" "natgw_z0" {
  allocation_id = aws_eip.eip_natgw_z0.id
  subnet_id     = aws_subnet.public_utility_z0.id
}

resource "aws_route_table" "routetable_private_utility_z0" {
  vpc_id = aws_vpc.vpc.id
}

resource "aws_route" "private_utility_z0_nat" {
  route_table_id         = aws_route_table.routetable_private_utility_z0.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.natgw_z0.id
}

Debug Output

N/A

Panic Output

N/A

Expected Behavior

I would expect that terraform-provider-aws saves the route in the state and waits for it to get ready. And terraform apply fails when the route cannot be available (in 2m) but the route is saved in the state and no state is lost/leaked.
Or alternatively terraform-provider-aws does not save the route in the state but ensures that the corresponding route (that cannot be available in 2m) is deleted afterwards.
I don't have much experience with handling of such cases, just a rough idea from my side.

Most probably the other resources handle this already in a better way and something similar is probably applicable for the aws_route.

For example I compared the subnet and route creation.

On subnet creation we call d.SetId(subnetId) (I think the part that ensures that the resources is saved in the state) right after the subnet create call and before the wait until the subnet gets ready. I think that with this handling we don't lose the subnet state if for example the subnet cannot get ready.

terraform-provider-aws/aws/resource_aws_subnet.go

Lines 150 to 176 in 9ff6122

    
           var err error 
        
           resp, err := conn.CreateSubnet(createOpts) 
        
           if err != nil { 
        
           	return fmt.Errorf("error creating subnet: %w", err) 
        
           } 
        
           // Get the ID and store it 
        
           subnet := resp.Subnet 
        
           subnetId := aws.StringValue(subnet.SubnetId) 
        
           d.SetId(subnetId) 
        
           log.Printf("[INFO] Subnet ID: %s", subnetId) 
        
           // Wait for the Subnet to become available 
        
           log.Printf("[DEBUG] Waiting for subnet (%s) to become available", subnetId) 
        
           stateConf := &resource.StateChangeConf{ 
        
           	Pending: []string{ec2.SubnetStatePending}, 
        
           	Target:  []string{ec2.SubnetStateAvailable}, 
        
           	Refresh: SubnetStateRefreshFunc(conn, subnetId), 
        
           	Timeout: d.Timeout(schema.TimeoutCreate), 
        
           } 
        
           _, err = stateConf.WaitForState() 
        
           if err != nil { 
        
           	return fmt.Errorf("error waiting for subnet (%s) to become ready: %w", d.Id(), err) 
        
           }

On the other side, on route creation we call d.SetId(tfec2.RouteCreateID(routeTableID, destination)) after the wait until the route gets available. So if the route does not get available, we don't save the resource state.

terraform-provider-aws/aws/resource_aws_route.go

Lines 233 to 253 in 9ff6122

    
           log.Printf("[DEBUG] Creating Route: %s", input) 
        
           _, err = tfresource.RetryWhenAwsErrCodeEquals( 
        
           	d.Timeout(schema.TimeoutCreate), 
        
           	func() (interface{}, error) { 
        
           		return conn.CreateRoute(input) 
        
           	}, 
        
           	tfec2.ErrCodeInvalidParameterException, 
        
           	tfec2.ErrCodeInvalidTransitGatewayIDNotFound, 
        
           ) 
        
           if err != nil { 
        
           	return fmt.Errorf("error creating Route in Route Table (%s) with destination (%s): %w", routeTableID, destination, err) 
        
           } 
        
           _, err = waiter.RouteReady(conn, routeFinder, routeTableID, destination) 
        
           if err != nil { 
        
           	return fmt.Errorf("error waiting for Route in Route Table (%s) with destination (%s) to become available: %w", routeTableID, destination, err) 
        
           } 
        
           d.SetId(tfec2.RouteCreateID(routeTableID, destination))

Actual Behavior

When applying the above partial terraform configuration, we notice that the aws_route is not saved in the state ("leaks") when the route fails to be available (in 2m) on creation.

The corresponding failure is:

error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {

Afterwards any subsequent terraform appy run fails with reason RouteAlreadyExists:

error creating Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0): RouteAlreadyExists: The route identified by 0.0.0.0/0 already exists.
	status code: 400, request id: <omitted>
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {

Background: we have a lot of automation on top of terraform apply. Such inconsistencies cause us troubles because terraform apply itself is not able to recover from such state where a resource is leaked and new one cannot be created because of it. And such cases require manual intervention to fix the terraform state and the state in the infrastructure.

Steps to Reproduce

terraform apply

Ensure that the first terraform apply can potentially fail that the route cannot get available for 2m

error waiting for Route in Route Table (rtb-07269a2e5202b1234) with destination (0.0.0.0/0) to become available: timeout while waiting for state to become 'ready' (timeout: 2m0s)
  on tf/main.tf line 354, in resource "aws_route" "private_utility_z0_nat":
 354: resource "aws_route" "private_utility_z0_nat" {

Ensure that the corresponding route is leaked and any subsequent terraform apply fails with reason RouteAlreadyExists.

Important Factoids

N/A

References

#0000

The text was updated successfully, but these errors were encountered:

ddvdozuki · 2021-09-25T11:17:51Z

I'm having this issue intermittently as well. I can't reliably reproduce it though.

gtantachuco · 2021-10-04T22:51:48Z

I get the same error. However, when I run 'terraform apply' again, it works.

dwyerk · 2021-10-06T20:44:56Z

I'm also seeing this in us-east-1 today. It's intermittent since most of the time AWS doesn't take 2 minutes to build a route.

@ialidzhikov - how does your team recover from this state besides manually deleting the routes that were ultimately created and trying to apply again?

ddvdozuki · 2021-10-06T21:16:48Z

The really bizarre thing is according to the code I found the timeout should be 5 minutes. In both the provider and the vpc module the timeout is set to 5m so I'm not sure where the 2m is coming from. I didn't do a deep dive so maybe someone with more expertise can figure out what's going on here.

BrandonALXEllisSS · 2021-10-07T16:16:54Z

According to the docs the creation timeout is 2m, whereas the deletion is 5m: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route

dwyerk · 2021-10-07T18:50:26Z

Good point @BrandonALXEllisSS - that may be a mitigation but as the OP points out, this is a bug because it leaves the state broken when the route creation times out.

ccourtoyaxis · 2021-10-13T20:31:14Z

Hi, I beleive the confiusion comes from the following pull request: #21161.
I also concur that 2m timout is too low, In our case the creation fails more than 50% of the time. I think that increasing it to 5 min is more than just a mitigation, it is a must. Otherwise, we get inconsistent results.

I don't know the AWS SLA regarding creation time, but I can assure you that it takes longer than 2 minutes most of the time.

ddvdozuki · 2021-10-13T23:02:23Z

@ccourtoyaxis are you saying that PR fixes this issue? If so I can upgrade to the latest AWS provider. It's definitely causing us issues because when it does happen it breaks the state and manual intervention is required to fix it.

ccourtoyaxis · 2021-10-14T00:47:29Z

@ddvdozuki, it is quite the opposite, what I am saying is that this is the cause of the initial failure. The fact that it is unrecoverable is different problem. But the root cause of the route creation which times out is the default timeout value which changed from 5 minutes to 2 minutes, which is too short. I have excluded provider aws 3.62.0 version (the version which contains the PR) and have not had the issue since. I only did 2 deployments so far, but this week it failed almost every time.

terraform {
  required_version = ">= 1.0.0"
  backend "s3" {
    bucket = "tf-state-acs-portal"
    key    = "reservation.feature/terraform"
    region = "us-east-1"
    dynamodb_table = "acs-portal-tf-state"
    encrypt        = true
  }

  required_providers {
    aws = {
      "source"  = "hashicorp/aws"
      "version" = ">= 3.34.0, != 3.62.0"
    }
  }
}

This being said, this issue about the failue not being stored in the state is in fact a real problem. As mentionned in earlier comments, you must delete the routes manually, otherwise on a subsequent run you get the ressource already exists error. So this get the user in a catch 22.

ddvdozuki · 2021-10-14T00:52:27Z

@ccourtoyaxis Ok that's weird then because I first encountered this issue in 3.56 of the AWS provider and right now in us-east-1 I'm unable to deploy at all because of the timeout issue. Even deleting the route manually and re-running the terraform fails again due to timeout. I've never seen it this bad before. I upgraded to 3.62 but as you said it's still there.

  required_providers {
    aws = {
      "source"  = "hashicorp/aws"
      "version" = ">= 3.34.0, != 3.62.0"
    }
  }

In that version string does terraform default to the newest version that's not 3.62 or does it just install 3.34 since you have >=?

ccourtoyaxis · 2021-10-15T15:03:17Z

@ddvdozuki I reverted to 3.38. I encountered the issue again even with 3.61. I changed my provider config as follows:

required_providers {
  aws = {
    "source"  = "hashicorp/aws"
    "version" = ">= 3.34.0, <= 3.38.0"
  }
}

Since then I deployed 5 times without a glitch

huguesalary · 2021-10-28T05:15:21Z

I am runnning into this issue as well, and it is quite cumbersome to have to manually delete created routes (or import them into the state).

The 2 minutes timeout is too low and should be adjusted.

nevelis · 2021-11-10T18:30:51Z

Using latest release (3.64.2), seems timeout I specify is still being ignored & using default of 2m30, which requires manual intervention because it eventually succeeds, then complains it already exists.

ialidzhikov · 2021-11-10T21:50:08Z

It seems that #21161 (included in 3.62.0) makes the RouteReady func to use a configurable timeout - TimeoutCreate. TimeoutCreate still defaults to 2m but the timeout at least is configurable now.

terraform-provider-aws/aws/resource_aws_route.go

Line 247 in fce7062

    
           _, err = waiter.RouteReady(conn, routeFinder, routeTableID, destination, d.Timeout(schema.TimeoutCreate))

Previously RouteReady was using a hard-coded timeout (PropagationTimeout, defaults to 2m) which was not configurable.

To my understanding upgrading to terraform-provider-aws >= 3.62.0 and specifying a high timeout should mitigate this issue:

resource "aws_route" "foo" {
  // ...  

  timeouts {
    create = "5m"
  }
}

Of course, the general issue that route is not saved in the state when it fails to be available for the given timeout is still present.

ialidzhikov · 2021-11-10T21:53:32Z

Using latest release (3.64.2), seems timeout I specify is still being ignored & using default of 2m30, which requires manual intervention because it eventually succeeds, then complains it already exists.

@nevelis are you sure that you specify create timeout and you use terraform-provider-aws >= 3.62.0? I hope that #21032 (comment) helps you to mitigate the issue. If it doesn't, please share.

cdancy · 2021-11-10T23:12:58Z

Hitting this issue as well and we're using 3.64.2 (latest release I believe).

ialidzhikov · 2021-11-11T06:19:16Z

Hitting this issue as well and we're using 3.64.2 (latest release I believe).

Do you specify a custom create timeout and what is its value?

cdancy · 2021-11-11T12:07:25Z

@ialidzhikov the only timeout we specify is on our aws_vpc_endpoint like so:

resource "aws_vpc_endpoint" "dynamodb" {
  count             = length(module.vpc.private_route_table_ids) > 0 ? 1 : 0
  vpc_endpoint_type = "Gateway"
  vpc_id            = module.vpc.vpc_id
  service_name      = "com.amazonaws.${data.aws_region.current.name}.dynamodb"
  tags              = local.tags
  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }

dguendisch · 2021-11-11T12:27:23Z

@cdancy but then it's clear that you are hit (because route creation often takes longer than the default 2m timeout). Your vpc timeout is not related to the aws_route timeout, the latter can be configured since 3.62.0, you could try that.

cdancy · 2021-11-11T19:26:50Z

@dguendisch @ialidzhikov we're using latest vpc module 3.64.2 and have put together our terrafrom script as described in THIS ISSUE. We should have the default 5m timeout at this point if I'm not mistaken.

Is there a way to easily configure the route timeouts other than what is noted in that ISSUE (and below)?

resource "aws_default_route_table" "rtb-default" {
  default_route_table_id = module.vpc.default_vpc_main_route_table_id
  timeouts {
    create = "30m"
    update = "30m"
  }
}

cdancy · 2021-11-12T02:16:01Z

Issue may actually be solved when the release for this PR is made #21683

justinretzolk · 2021-11-15T19:32:23Z

Hi all 👋 This PR made it in to the 3.65.0 release of the provider, and as @cdancy mentioned, I'm curious as to whether that PR may have resolved this issue as well. Can someone who was previously experiencing this test with this new provider version and see if that corrected the issue?

benbense · 2021-11-15T22:39:12Z

@justinretzolk I can confirm transitioning to version 3.65.0 does in fact fix the issue for me.

cdancy · 2021-11-16T13:39:13Z

@justinretzolk @benbense moving to latest and greatest did not solve the issue for us and it looks like at least for one other person as well

benbense · 2021-11-16T13:43:25Z

@cdancy After checking with a colleague of mine seems like the issue wasn't fixed for him also, maybe it has to do with the Terraform version also? I'm running 1.0.11.
He's running an older version and will try to update it today.

cdancy · 2021-11-16T13:46:07Z

@benbense ohhh ... yeah please respond back if that does fix things. We're on 1.0.8

benbense · 2021-11-16T15:10:27Z

@cdancy unfortunately it didn't help

sandyvanam · 2021-11-17T14:00:22Z

Temporary Solution : move to respective route table and delete the entry 0.0.0.0/0 and run terraform apply.

cdancy · 2021-11-17T14:11:04Z

@sandyvanam what do you mean by "move to respective route table"?

i-engy · 2021-11-29T13:51:39Z

Please make aws_route respect timeouts defined in configuration. its failing all the times after 2m30s. I have tried a dozen version combinations and still same result. 21 tries and 2m30s timeout.

ialidzhikov · 2021-11-29T14:14:19Z

Please make aws_route respect timeouts defined in configuration. its failing all the times after 2m30s. I have tried a dozen version combinations and still same result. 21 tries and 2m30s timeout.

Try out with 3.66.0 (ref #21831).

cdancy · 2021-11-29T18:37:12Z

@ialidzhikov still failing for us:

10:23:17          	            	    - Error: "terraform apply -auto-approve -no-color -input=false plan.json" command failed with code 1
10:23:17          	            	               
10:23:17          	            	               Error: error reading Route Table Association (rtbassoc-0bf64424a72d7ab4f): empty result
10:23:17          	            	               
10:23:17          	            	                 with module.vpc.aws_route_table_association.public[2],
10:23:17          	            	                 on .terraform/modules/vpc/main.tf line 1204, in resource "aws_route_table_association" "public":
10:23:17          	            	               1204: resource "aws_route_table_association" "public" {
10:23:17          	            	               
10:23:17          	            	               
10:23:17          	            	               Error: error reading Route Table Association (rtbassoc-05f19b75d7e7b114a): empty result
10:23:17          	            	               
10:23:17          	            	                 with module.vpc.aws_route_table_association.public[1],
10:23:17          	            	                 on .terraform/modules/vpc/main.tf line 1204, in resource "aws_route_table_association" "public":
10:23:17          	            	               1204: resource "aws_route_table_association" "public" {

also manifests for us like so:

11:21:47          	            	    - Error: "terraform apply -auto-approve -no-color -input=false plan.json" command failed with code  1
11:21:47          	            	               
11:21:47          	            	               Error: error reading Route Table (rtb-00722ec3e4d58322d): couldn't find resource
11:21:47          	            	               
11:21:47          	            	                 with module.vpc.aws_route_table.public[0],
11:21:47          	            	                 on .terraform/modules/vpc/main.tf line 203, in resource "aws_route_table" "public":
11:21:47          	            	                203: resource "aws_route_table" "public" {

I'd be up for a screen-share to show you what we're doing to see if you can spot anything. Just let me know.

EDIT: is there a specific vpc module version we should be at least using? We're on v3.6.0 and I see latest is v3.11.0. Not sure if that helps one way or another.

EDIT-2: ran a handful of times and we keep seeing the failure. We do see a random success every now and then but it's few and far between.

i-engy · 2021-11-30T15:29:52Z

yes. I can confirm that v3.66 has fixed this issue after @ialidzhikov changes. Big thanks to @ialidzhikov for updating route retries to 1000. cheers.

From 2 minutes to 5 minutes to address hashicorp#21032

github-actions · 2021-12-16T23:58:26Z

This functionality has been released in v3.70.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

Riinkesh · 2022-03-23T15:54:08Z

This is still the issue even with 1001 retries. The issue is that terraform although creates the route, but doesn't update the state and keeps retrying until timeout and/or retries limit.
Terraform should be able to check the route status with every retry & therefore should update the state correctly.

github-actions · 2022-05-06T02:28:02Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions bot added needs-triage Waiting for first response or review from a maintainer. service/ec2 Issues and PRs that pertain to the ec2 service. labels Sep 24, 2021

ialidzhikov changed the title ~~resource/aws_route: route is not saved in the state when route fails to be available (in 2m) on creation~~ resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation Sep 24, 2021

justinretzolk added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Sep 24, 2021

ialidzhikov mentioned this issue Sep 30, 2021

Terraform does not respect creation timeout gardener/gardener-extension-provider-aws#419

Closed

ccourtoyaxis mentioned this issue Oct 13, 2021

private_nat_gateway fail to create on time causing Terraform to fail and it is unrecoverable terraform-aws-modules/terraform-aws-vpc#699

Closed

ialidzhikov mentioned this issue Oct 20, 2021

Update hashicorp/terraform-provider-aws version gardener/gardener-extension-provider-aws#433

Closed

4 tasks

huguesalary mentioned this issue Oct 28, 2021

Increase route_table Create timeout #21531

Merged

huguesalary added a commit to huguesalary/terraform-provider-aws that referenced this issue Dec 15, 2021

Increase route_table Create timeout

6ede9dd

From 2 minutes to 5 minutes to address hashicorp#21032

anGie44 closed this as completed in #21531 Dec 16, 2021

github-actions bot added this to the v3.70.0 milestone Dec 16, 2021

github-actions bot locked as resolved and limited conversation to collaborators May 6, 2022

resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation #21032

resource/aws_route: route is not saved in the state when it fails to be available (in 2m) on creation #21032

Comments

ialidzhikov commented Sep 24, 2021

Community Note

Terraform CLI and Terraform AWS Provider Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

ddvdozuki commented Sep 25, 2021

gtantachuco commented Oct 4, 2021

dwyerk commented Oct 6, 2021

ddvdozuki commented Oct 6, 2021

BrandonALXEllisSS commented Oct 7, 2021

dwyerk commented Oct 7, 2021

ccourtoyaxis commented Oct 13, 2021

ddvdozuki commented Oct 13, 2021

ccourtoyaxis commented Oct 14, 2021

ddvdozuki commented Oct 14, 2021 • edited Loading

ccourtoyaxis commented Oct 15, 2021

huguesalary commented Oct 28, 2021

nevelis commented Nov 10, 2021

ialidzhikov commented Nov 10, 2021

ialidzhikov commented Nov 10, 2021

cdancy commented Nov 10, 2021

ialidzhikov commented Nov 11, 2021

cdancy commented Nov 11, 2021

dguendisch commented Nov 11, 2021

cdancy commented Nov 11, 2021

cdancy commented Nov 12, 2021

justinretzolk commented Nov 15, 2021

benbense commented Nov 15, 2021

cdancy commented Nov 16, 2021

benbense commented Nov 16, 2021

cdancy commented Nov 16, 2021

benbense commented Nov 16, 2021

sandyvanam commented Nov 17, 2021 • edited Loading

cdancy commented Nov 17, 2021

i-engy commented Nov 29, 2021

ialidzhikov commented Nov 29, 2021

cdancy commented Nov 29, 2021 • edited Loading

i-engy commented Nov 30, 2021

github-actions bot commented Dec 16, 2021

Riinkesh commented Mar 23, 2022

github-actions bot commented May 6, 2022

ddvdozuki commented Oct 14, 2021 •

edited

Loading

sandyvanam commented Nov 17, 2021 •

edited

Loading

cdancy commented Nov 29, 2021 •

edited

Loading