Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDS InvalidDBInstanceState: Instance cannot currently reboot due to an in-progress management operation #11905

Closed
kbaldyga opened this issue Feb 5, 2020 · 15 comments · Fixed by #22178
Assignees
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. service/iam Issues and PRs that pertain to the iam service. service/rds Issues and PRs that pertain to the rds service.
Milestone

Comments

@kbaldyga
Copy link

kbaldyga commented Feb 5, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.12.20
provider.aws v2.46.0

Affected Resource(s)

  • aws_db_instance
  • aws_db_parameter_group

Terraform Configuration Files

terraform {
  required_providers {
    aws = "= 2.46.0"
  }
}
provider "aws" { region = "us-west-1" }

data "aws_vpc" "wailupes-main" {
  filter {
    name   = "tag:Name"
    values = ["wailupes-main"]
  }
}
data "aws_iam_role" "enhanced_monitoring" {
  name = "staging-enhanced-monitoring"
}

resource "aws_db_instance" "rds" {
  identifier                   = "test-rds"
  allocated_storage            = 100
  engine                       = "postgres"
  engine_version               = "11.1"
  instance_class               = "db.m4.large"
  name                         = "testdb"
  username                     = "testuser"
  password                     = "testpassword"
  db_subnet_group_name         = "wailupes-rds"
  parameter_group_name         = "postgres-11-tuned-staging"
  multi_az                     = false
  storage_type                 = "gp2"
  storage_encrypted            = false
  auto_minor_version_upgrade   = false
  apply_immediately            = true
  deletion_protection          = false
  kms_key_id                   = ""
  performance_insights_enabled = false
  backup_retention_period      = 1
  ca_cert_identifier           = "rds-ca-2019"
  monitoring_interval          = 30
  monitoring_role_arn          = data.aws_iam_role.enhanced_monitoring.arn
  skip_final_snapshot          = true

  timeouts {
    update = "120m"
  }
}

resource "aws_db_instance" "rds-read" {
  identifier                 = "test-rds-read-0"
  allocated_storage          = 100
  engine                     = "postgres"
  engine_version             = "11.1"
  instance_class             = "db.m4.large"
  username                   = "testuser"
  parameter_group_name       = "postgres-11-tuned-staging"
  storage_type               = "gp2"
  storage_encrypted          = false
  replicate_source_db        = aws_db_instance.rds.id
  auto_minor_version_upgrade = false
  apply_immediately          = true

  monitoring_interval          = 30
  monitoring_role_arn          = data.aws_iam_role.enhanced_monitoring.arn
  kms_key_id                   = ""
  performance_insights_enabled = false
  skip_final_snapshot          = true
  ca_cert_identifier           = "rds-ca-2019"
}

Debug Output

Shortened debug output here: https://gist.github.com/kbaldyga/825f0239776463a69969b847f35d53bd

Expected Behavior

When adding a read-replica to an existing RDS instance, with a custom db parameter group, enhanced monitoring and ca_cert_identifier, terraform will randomly fail with Instance cannot currently reboot due to an in-progress management operation. The read replica is eventually correctly created, but the resource is marked as tainted and terraform returns an error response code.

Actual Behavior

When adding a read-replica to an existing RDS instance, terraform aws provider performs multiple steps:

  1. creates a read replica (I can see in the log file rds/CreateDBInstanceReadReplica), this than waits (rds/DescribeDBInstances) for the instance to be available,
  2. next it calls ModifyDBInstance (see attached log file), this again calls rds/DescribeDBInstances multiple times and waits for the instance to be available,
  3. once the instance is available, terraform calls rds/RebootDBInstance. But in the meantime AWS decides to apply changes to the instance and the call to rds/RebootDBInstance fails.

Because this all depends on time, it's difficult to consistently reproduce the issue. But after spending some time with various configurations, I am pretty confident it's the combination of all 3: enhanced monitoring, ca_cert_identifier, and custom parameter group in the resource "aws_db_instance" "rds-read" that's causing the issue.
As a workaround we decided to remove the ca_cert_identifier for now from our terraform configuration, since "rds-ca-2019" is the new default anyways.

@ghost ghost added the service/rds Issues and PRs that pertain to the rds service. label Feb 5, 2020
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Feb 5, 2020
@ghost ghost added service/ec2 Issues and PRs that pertain to the ec2 service. service/iam Issues and PRs that pertain to the iam service. labels Feb 5, 2020
@roscoecairney
Copy link

roscoecairney commented Mar 2, 2020

I think the ca_cert_identifier is the cause of this. In govcloud, this value needs to be set to "rds-ca-2017". Each time the provider attempts to create a aws_rds_cluster_instance in govcloud, the apply fails with

Error: error rebooting DB Instance (xxxx-gov-dev-3): InvalidDBInstanceState: Instance cannot currently reboot due to an in-progress management operation.
	status code: 400, request id: 597cafa8-c9dd-4ee9-9678-9e3d6e11efd5

It is possible to get past this by untainting the resource and running the apply again

Reproduced on provider version 2.51.0

@nijave
Copy link
Contributor

nijave commented Apr 27, 2020

Seeing this as well in us-east-1 (the OP is from us-west-1). It looks like Terraform should probably just retry in the face of these errors

nijave added a commit to nijave/terraform-provider-aws that referenced this issue Apr 27, 2020
nijave added a commit to Root-App/terraform-provider-aws that referenced this issue May 19, 2020
nijave added a commit to Root-App/terraform-provider-aws that referenced this issue Aug 14, 2020
@rgonzales5-chwy
Copy link

rgonzales5-chwy commented Nov 24, 2020

I have faced this error every time i try to create new replica instances; there has not even been a single time where this succeeds. I do notice everytime that the replica instance gets created successfully in the console but the terraform state still marks it as a tainted resource. This is a very nasty bug preventing the creation of new RDS replicas on AWS which could also cause breaks to your service configuration when running new RDS replica creation with other configuration changes. Until this is fixed, i would not suggest running any other type of configuration along with the creation of RDS replicas. The workaround I did was to import the tainted RDS replica resources into my state since the resources were created successfully in the console. Terraform please fix this!!

@nijave
Copy link
Contributor

nijave commented Nov 24, 2020

I have faced this error every time i try to create new replica instances; there has not even been a single time where this succeeds. I do notice everytime that the replica instance gets created successfully in the console but the terraform state still marks it as a tainted resource. This is a very nasty bug preventing the creation of new RDS replicas on AWS which could also cause breaks to your service configuration when running new RDS replica creation with other configuration changes. Until this is fixed, i would not suggest running any other type of configuration along with the creation of RDS replicas. The workaround I did was to import the tainted RDS replica resources into my state since the resources were created successfully in the console. Terraform please fix this!!

You're seeing this because on Amazon's side some of the configuration is done as separate operations but on the Terraform side it's all represented as a single object. I know enhanced monitoring works like this so I imagine people with more config spread across more API calls hit this more often

@rgonzales5-chwy
Copy link

I have faced this error every time i try to create new replica instances; there has not even been a single time where this succeeds. I do notice everytime that the replica instance gets created successfully in the console but the terraform state still marks it as a tainted resource. This is a very nasty bug preventing the creation of new RDS replicas on AWS which could also cause breaks to your service configuration when running new RDS replica creation with other configuration changes. Until this is fixed, i would not suggest running any other type of configuration along with the creation of RDS replicas. The workaround I did was to import the tainted RDS replica resources into my state since the resources were created successfully in the console. Terraform please fix this!!

You're seeing this because on Amazon's side some of the configuration is done as separate operations but on the Terraform side it's all represented as a single object. I know enhanced monitoring works like this so I imagine people with more config spread across more API calls hit this more often

no really, this issue happened to me while I attempted to create new replicas by themselves with no other resources, everytime

@rgonzales5-chwy
Copy link

I think the ca_cert_identifier is the cause of this. In govcloud, this value needs to be set to "rds-ca-2017". Each time the provider attempts to create a aws_rds_cluster_instance in govcloud, the apply fails with

Error: error rebooting DB Instance (xxxx-gov-dev-3): InvalidDBInstanceState: Instance cannot currently reboot due to an in-progress management operation.
	status code: 400, request id: 597cafa8-c9dd-4ee9-9678-9e3d6e11efd5

It is possible to get past this by untainting the resource and running the apply again

Reproduced on provider version 2.51.0

@roscoecairney what would be the explanation behind the ca_cert_identifier being the issue and how did you identify this. provide more details please

@adv4000
Copy link
Contributor

adv4000 commented Dec 17, 2020

Strange behavior, also after removing ca_cert_identifier worked fine.

@mgtrrz
Copy link

mgtrrz commented Feb 23, 2021

Can confirm this issue exists even with the latest version (v.3.29.1 at the time of this comment) of the provider. Only occurs on creation of new read replicas with enhanced monitoring enabled. Like the others, we didn't need to specify ca_cert_identifier, so removing it fixed it for us.

@AlonAvrahami
Copy link

AlonAvrahami commented Mar 17, 2021

Any news on this issue? facing the same behavior.
im not using ca_cert_identifier, neither enhanced monitoring, but still facing this problem.

When running apply i get this error:
Error: error rebooting DB Instance (RDS_NAME): InvalidDBInstanceState: Instance cannot currently reboot due to an in-progress management operation.

And when running the plan again, it want to replace the resource:
# module.replica[0].module.db_instance.aws_db_instance.this[0] is tainted, so must be replaced

EDIT:
After updating the module to version 2.31.0 the problem solved (including removal of ca_cert_identifier, enhanced monitoring)

Please advice.

@blongman-snapdocs
Copy link

Still running into this. 0.12.28 for me. It would be nice if the provider had backoff/retry logic on this.

@justinretzolk
Copy link
Member

Hey y'all 👋 Thank you for taking the time to file this issue, and for the ongoing discussion. Given that there's been a number of AWS provider releases since the last update, can anyone confirm whether you're still experiencing this behavior?

@justinretzolk justinretzolk added waiting-response Maintainers are waiting on response from community or contributor. and removed needs-triage Waiting for first response or review from a maintainer. labels Nov 18, 2021
@danhooper
Copy link

We saw this error on version 3.64.1 just the other day.

@github-actions github-actions bot removed the waiting-response Maintainers are waiting on response from community or contributor. label Nov 19, 2021
@justinretzolk justinretzolk added the bug Addresses a defect in current functionality. label Nov 19, 2021
@bpazdziur
Copy link

bpazdziur commented Nov 29, 2021

I have the same issue with provider version 3.67.0 and 3.37.0

I created a request regarding this issue to AWS Support center and I got following response:

Please refer to the below timeline with regard to your test of the instance named *** according to the attached log file.
2021-11-19 14:01:37 UTC DeleteDBInstance
2021-11-19 14:00:38 UTC RebootDBInstance >>Here, reboot command launched 32 seconds after the ModifyInstance with error 'cannot currently reboot due to an in-progress management operation.'
2021-11-19 14:00:06 UTC ModifyDBInstance
2021-11-19 13:53:26 UTC CreateDBInstance
2021-11-19 12:19:34 UTC DeleteDBInstance
2021-11-19 12:17:29 UTC RebootDBInstance >>>Here, reboot command launched 65 seconds after the Modify
2021-11-19 12:16:24 UTC ModifyDBInstance
2021-11-19 12:09:58 UTC CreateDBInstance
2021-11-19 11:29:25 UTC DeleteDBInstance
2021-11-19 11:27:13 UTC RebootDBInstance

As you can see, your first reboot was successful. The command was launched 65 seconds after the Modify.
The second reboot was failed. The command was launched 32 seconds after the Modify.

Possible reason --
If you modify an instance from the AWS console, you may notice this kind of phenomenon.
You submitted a modification, you refresh the page and watching the status of the instance.
You could see the status is still Available and becomes to Modifying after a few seconds.
This is how it works by design.

Suggestions to you --
The error message and the API log clearly demonstrated the instance status was not accepting the reboot.
I believe this is the reason without a doubt.
You can try to improve your code if possible with the below two factors

  1. Simply add 60s sleep between your ModifyDBInstanc]e and RebootDBInstance
  2. I guess your waitUntilDBInstanceAvailableAfterUpdate function has a loop to check the instance status every N seconds. You can consider to move to the next step by N times of successed check. (N>3)

@gdavison gdavison self-assigned this Nov 30, 2021
@github-actions
Copy link

This functionality has been released in v4.0.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. service/iam Issues and PRs that pertain to the iam service. service/rds Issues and PRs that pertain to the rds service.
Projects
None yet