Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform attempts to destroy AWS ECS cluster before Deleting ECS Service #4852

Closed
ghost opened this issue Jun 16, 2018 · 33 comments
Closed
Labels
bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service.

Comments

@ghost
Copy link

ghost commented Jun 16, 2018

This issue was originally opened by @jaloren as hashicorp/terraform#18263. It was migrated here as a result of the provider split. The original body of the issue is below.


I am using the aws_cloudformation_stack resource to provision an aws Elastic Container Service cluster and one or more services in that cluster. I used terraform graph -type=plan-destroy to verify that I successfully set up a dependency relationship in terraform between the TF resource for creating the service and the TF resource for creating the ECS cluster.

According to graphviz, the service is a child node of the ecs cluster node. Given that, I am expecting TF to delete the service and then delete the cluster. However, this seems to happen out of order, which causes the delete of the ECS cluster to fail since you can't delete a cluster that has services in it.

Terraform Version

Terraform v0.11.8

Expected Behavior

Terraform successfully delete aws ECS cluster and its associated services.

Actual Behavior

Terraform successfully deleted the service in the ECS cluster but failed to delete the ECS cluster itself with the following error:

* aws_cloudformation_stack.ecs-cluster: DELETE_FAILED: ["The following resource(s) failed to delete: [ECSCluster]. " "The Cluster cannot be deleted while Services are active. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsServicesException; Request ID: 7bcbeae4-70ab-11e8-bd0b-3d3254c7f7d3)"]

Steps to Reproduce

Please list the full steps required to reproduce the issue, for example:

  1. terraform init
  2. terraform plan
  3. terraform apply
@avengers009
Copy link

This Should work if we stop the ECS services and try deleting ECS Cluster.

@radeksimko
Copy link
Member

@avengers009 you're right, but ideally Terraform should be able to schedule these actions accordingly, where possible, or if not possible the user should be able to hint Terraform via depends_on. TL;DR users shouldn't need to manually touch the infrastructure in order to run apply or destroy successfully.

@jaloren Do you mind sharing the configs with us to understand the relationships between resources and allow us reproduce the problem?

Thanks.

@radeksimko radeksimko added bug Addresses a defect in current functionality. waiting-response Maintainers are waiting on response from community or contributor. service/ecs Issues and PRs that pertain to the ecs service. labels Jun 21, 2018
@Kartstig
Copy link

I am also seeing this issue:

Error: Error applying plan:

1 error(s) occurred:

* aws_ecs_cluster.ecs (destroy): 1 error(s) occurred:

* aws_ecs_cluster.ecs: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.
	status code: 400, request id: 30e1e812-854c-11e8-bec1-397064633d2b

Here is my configuration:

ecs_service

resource "aws_ecs_service" "authenticator" {
  name            = "authenticator"
  cluster         = "${aws_ecs_cluster.ecs.id}"
  task_definition = "${aws_ecs_task_definition.authenticator.arn}"
  desired_count   = 2

  load_balancer {
    target_group_arn = "${aws_lb_target_group.authenticator.arn}"
    container_name   = "authenticator"
    container_port   = 3030
  }
}

ecs_cluster

resource "aws_ecs_cluster" "ecs" {
  name = "${local.safe_name_prefix}"
}

@bflad
Copy link
Contributor

bflad commented Jul 12, 2018

@Kartstig is that error occurring for you after 10 minutes or so of trying?

@Kartstig
Copy link

Yes it does. I usually make an attempt to destroy twice to account for any timeouts

@shusak
Copy link

shusak commented Jul 19, 2018

I'm seeing very similar behavior with Terraform 0.11.7/AWS provider 1.19. I am frequently (but not every time) seeing this behavior:

00:12:27.512 aws_ecs_cluster.ecs_cluster: Still destroying... (ID: arn:aws:ecs:us-east-1:<MYACCOUNT>:cluster/my-service, 9m50s elapsed)
00:12:36.041 
00:12:36.042 Error: Error applying plan:
00:12:36.043 
00:12:36.044 1 error(s) occurred:
00:12:36.045 
00:12:36.045 * aws_ecs_cluster.ecs_cluster (destroy): 1 error(s) occurred:
00:12:36.046 
00:12:36.046 * aws_ecs_cluster.ecs_cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.
00:12:36.047 	status code: 400, request id: b920a9e3-8b45-11e8-8e1a-0751c6fe0d1a

@jaloren
Copy link

jaloren commented Aug 18, 2018

@radeksimko I am not sure how much of the configs you would like to see. Its a little bit involved. But here's the key part of the main.tf in the root module.

Each module is nothing but a wrapper for a cloudformation template. So by referring to the output from one module as input in another, I am establishing a dependency between the two resources encapsulated in each module. Ergo, I am expecting on a destroy that the cluster would be deleted after the service since the service depends on the cluster

module "public_load_balancer" {
  source         = "../../modules/aws/network/load_balancer/alb"
  environment    = "${var.environment}"
  security_group = "${module.network_acls.load_balancer_security_group}"
  vpc            = "${module.network.vpc}"
  subnets        = "${module.network.public_one_subnet},${module.network.public_two_subnet}"
}

module "ecs_cluster" {
  source      = "../../modules/aws/ecs/cluster"
  environment = "${var.environment}"
}

module "log_group" {
  source        = "../../modules/aws/logs/log_group"
  environment   = "${var.environment}"
  log_retention = 3
}

module "ecs_application" {
  source                   = "../../modules/aws/ecs/services/ecsapp"
  subnets                  = "${module.network.ecs_traffic_one},${module.network.ecs_traffic_two}"
  target_group             = "${module.public_load_balancer.enrollment_api_target_group}"
  environment              = "${var.environment}"
  security_group           = "${module.network_acls.container_security_group}"
  vpc                      = "${module.network.vpc}"
  tag                      = "v1.0.0"
  log_group                = "${module.log_group.id}"
  cluster_name             = "${module.ecs_cluster.name}"
}

@bflad bflad removed the waiting-response Maintainers are waiting on response from community or contributor. label Oct 9, 2018
@swagatata
Copy link

Any update on this issue? Is there a plan to fix this? Or at least provide/output a machine readable list of services to be destroyed before destroying the instances?

@orlando
Copy link

orlando commented Oct 26, 2018

I think Terraform should stop/terminate the instances as part of the destroy process, right now you have to manually terminate instances in order for the destroy action to finish.

@swagatata
Copy link

swagatata commented Oct 31, 2018

hey, we are trying to automate this destruction of instances instead of doing it manually. Is there a recommended way to automate this? Our application code is in Java.

One way to do this could be to parse the generated terraform plan(by "terraform destroy" command). Can you help us find a way to parse the terraform plan to identify what instances/clusters need to be destroyed?

@sozay
Copy link

sozay commented Apr 25, 2019

You can prevent that situation with splitting your terraform project in at least two. You can use remote_state for that. If you put ECS cluster and service creation in two different projects, when you want to destroy, you can call first destroy process of service, then ECS cluster can be destroyed without any problem

@aaronjhoffman
Copy link

Is there any solution here? Terraform was working great for me and now I'm having the same error "The Cluster cannot be deleted while Services are active" and don't understand why I need to manually stop/terminate the instances...

@neXussT
Copy link

neXussT commented Aug 27, 2019

I am seeing this with 0.12.7 in my company's production environment intermittently. Is there any way to specify a "depends_on" or "teardown_first" which works for teardown?

@archenroot
Copy link

I am seeing this still on latest version...

@aaronsteers
Copy link

aaronsteers commented Feb 5, 2020

I'm here for the same issue - has anyone found a workaround? Or can anyone confirm that this sometimes works (even after n retries)? Otherwise, it seems the aws_ecs_service resource is broken. The core promise is that terraform apply followed by terraform destroy will just work.

Hoping to better understand if this never works or if it's just a retry/interim issue or an issue particular to a set of configs.

UPDATE: In my particular instance, I can confirm upon retry that terraform destroy does not list the ECS cluster as something to be destroyed - meaning the destroy of the ECS service failed at some point but was logged as destroyed anyway. (Or conversely, I guess, I could have been created and not correctly confirmed as created.) I will post back here if I have additional test results.

@fadhlirahim
Copy link

+1 having the same issue here. Latest version on Terraform Cloud

@soumialeghzaoui
Copy link

I have the same issue with terraform 0.12.19

@yingw787
Copy link

yingw787 commented May 6, 2020

Hey everyone, I'm using AWS CloudFormation and I'm experiencing this issue as well. I'm currently suspecting that it's not an issue with either CloudFormation or Terraform, but possibly with the underlying EC2 AMI. I'm using the Amazon Linux 2 AMI, while an example I'm referencing is using Amazon Linux 1, and the latter deletes fine while my former does not (even with an explicit DependsOn and Refs sprinkled throughout). There were a good number of changes to Amazon Linux 2, which I'm guessing may have included a change to cfn-bootstrap which might impact /opt/aws/cfn-signal behaviors. I haven't tested this out though.

@mikalai-t
Copy link

Not sure if this is the right place to complain, but probably the same issue here:

Error: Error draining autoscaling group: Group still has 1 instances

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining

Surprisingly, 2 moments:

  • I logged into the AWS console, noticed that ECS instance is in "Active" state, but was able to remove ECS Cluster immediately, without any warning/error! That EC2 instance kept working until I terminated it manually.
  • somehow, sometimes, it worked before!

Terraform v0.12.20 code is being used:

data "aws_ami" "amazon2_ecs_optimized" {}

resource "aws_launch_template" "this" {}

resource "aws_autoscaling_group" "this" {}

resource "aws_ecs_task_definition" "this" {}

resource "aws_ecs_service" "default" {
  #  ...
  depends_on = [
    # consider note at https://www.terraform.io/docs/providers/aws/r/ecs_service.html
    aws_iam_role_policy.ecs_service
  ]
  # ...
}

resource "aws_ecs_cluster" "application" {}

p.s. will try to build workaround with null_resource and local-exec provisioner with when = destroy strategy running aws cli to find and deregister ECS EC2 instances... but it's sad in terms of "reliable" cloud services.

@Amit30891
Copy link

I have also faced the exact similar issue as raised by mikalai-t.
@mikalai-t would you like to share what steps did you follow as workaround.

@mikalai-t
Copy link

Still didn't implement a workaround, but... I noticed that sometimes even termination process took a while, so I assumed our application becomes unresponsive and consumes too much CPU and therefore EC2 instance failed to respond in time.
I just configured t3a.small instead of t3a.micro and the issue hasn't appeared since then. Not sure if this is a final solution, but you can start from analyzing your application behavior on a different instance type.
Also I would recommend to check current instance's protect from scale-in setting. I had similar issue when I stopped using ECS Capacity Provider and forgot to set this setting to false.
btw... Even with capacity provider configured in the cluster I faced timeouts when destroying the ASG, but after a couple of repeated attempts it was always successful.

@anGie44 anGie44 removed this from the v2.69.0 milestone Jun 30, 2020
emileswarts added a commit to ministryofjustice/staff-device-dns-dhcp-infrastructure that referenced this issue Sep 14, 2020
Forcing this has no effect, and it is a known bug in Terraform.

hashicorp/terraform-provider-aws#4852
emileswarts added a commit to ministryofjustice/staff-device-dns-dhcp-infrastructure that referenced this issue Sep 14, 2020
We are seeing dependency issues when running `terraform destroy`
Two issues are preventing a clean destroy:

1. Terraform attempts to destroy network resources before other
resources. This fails because you cannot destroy a VPC when you have
services running in it.

2. Terraform attempts to destroy the ECS cluster before the auto scaling
group that serves as the compute for the capacity provider.

This PR addresses the first issue, by leveraging the module `depends_on`
feature in Terraform 0.13.

The second issue still needs to be addressed by extracting the auto
scaling group into its own module and having the ECS cluster depend on
it. hashicorp/terraform-provider-aws#4852

To use this for local development, run `make init`, which will
reconfigure the state to use the new version of Terraform.

A PR following this will remove the `-reconfigure` flag from the
Makefile once everyone has upgraded.
emileswarts added a commit to ministryofjustice/staff-device-dns-dhcp-infrastructure that referenced this issue Sep 14, 2020
We are seeing dependency issues when running `terraform destroy`
Two issues are preventing a clean destroy:

1. Terraform attempts to destroy network resources before other
resources. This fails because you cannot destroy a VPC when you have
services running in it.

2. Terraform attempts to destroy the ECS cluster before the auto scaling
group that serves as the compute for the capacity provider.

This PR addresses the first issue, by leveraging the module `depends_on`
feature in Terraform 0.13.

The second issue still needs to be addressed by extracting the auto
scaling group into its own module and having the ECS cluster depend on
it. hashicorp/terraform-provider-aws#4852

To use this for local development, run `make init`, which will
reconfigure the state to use the new version of Terraform.

A PR following this will remove the `-reconfigure` flag from the
Makefile once everyone has upgraded.
emileswarts added a commit to ministryofjustice/staff-device-dns-dhcp-infrastructure that referenced this issue Sep 14, 2020
* Upgrade Terraform to version 0.13

We are seeing dependency issues when running `terraform destroy`
Two issues are preventing a clean destroy:

1. Terraform attempts to destroy network resources before other
resources. This fails because you cannot destroy a VPC when you have
services running in it.

2. Terraform attempts to destroy the ECS cluster before the auto scaling
group that serves as the compute for the capacity provider.

This PR addresses the first issue, by leveraging the module `depends_on`
feature in Terraform 0.13.

The second issue still needs to be addressed by extracting the auto
scaling group into its own module and having the ECS cluster depend on
it. hashicorp/terraform-provider-aws#4852

To use this for local development, run `make init`, which will
reconfigure the state to use the new version of Terraform.

A PR following this will remove the `-reconfigure` flag from the
Makefile once everyone has upgraded.

* Manually remove auto scaling groups before destroy

Due to a bug in Terraform, ECS is unable to delete before the auto
scaling group has been removed.

Use the aws command line in combination with your current workspace to
delete the auto scaling group as a separate step before running
terraform destroy.

This is wrapped up in `make destroy`, and `terraform destroy` should not
be used.

Because calling aws from the command line is unable to assume a role
unless the arn is known, the `aws-vault` commands need to be hardcoded
within the Makefile.
@Zogoo
Copy link

Zogoo commented Dec 3, 2020

It's still happening Terraform 0.12.26 with aws provider 3.19.
Error:

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.
Error: Error waiting for internet gateway (igw-0cab*******25) to detach: timeout while waiting for state to become 'detached' (last state: 'detaching', timeout: 15m0s)

Reason:
In my case I was using AWS Capacity provider for my ECS cluster and I have 90% for capacity (below 100% is same). As result instances were running even ECS services and tasks already deleted.

Work around:
So, I have to setup desired size and min size 0 for auto scaling group by AWS CLI. Then doing terraform destroy

aws autoscaling update-auto-scaling-group --auto-scaling-group-name "my-auto-scaling-group-name" --min-size 0 --desired-capacity 0

But I think this action should be handled by aws provider when do Terraform destroy.

@jm4games
Copy link

jm4games commented Dec 4, 2020

This issue still repos on terraform v0.14.0 and aws provider >= 3.16. Something I have noticed is that it spins on deleting the capacity provider. If I manually delete the capacity provider (from aws UI) it occurs right away. Maybe terraform is making an improper call to AWS API?

@deeco
Copy link

deeco commented Dec 9, 2020

same issue in v0.14.0 for me also , i get it when more than 1 service and task definition is defined and created

mslipets added a commit to mslipets/terraform-aws-ecs that referenced this issue Jan 15, 2021
(The Cluster cannot be deleted/renamed while Container Instances are active or draining. )
+ attempt to inverse dependencies on efs_sg_ids and efs_id for ASG aws_launch_configuration
@tiberiu89
Copy link

Any updates on this? I'm having one of the issues mentioned above, terraform cannot delete ECS cluster with active Container Instances. I'm using ECS managed ASG setup. I think the order of destruction is correct. So ASG is created before ECS, ECS depends on the ASG ARN. When running destroy, it tries to apply that on ECS first. Are there any means to bypass this check when destroying? maybe force the cluster to be removed, so that ASG removal can kick in. Right know I have to manually delete the ASG when terraform tries to remove the cluster

lazzurs added a commit to lazzurs/terraform-aws-ecs that referenced this issue Feb 2, 2021
@Axent96
Copy link

Axent96 commented Feb 5, 2021

I have the same problem...

@matt-brewster
Copy link

We intermittently get this error too when destroying our infrastructure. We have a retry built into our wrapper scripts and on Friday our failure looked like this:

2021-02-12 18:52:56 Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

<< retry the destroy>>
<< 20 minutes of module.ecs_cluster.aws_ecs_capacity_provider.this: Still destroying... >>
<< then >>

2021-02-12 19:13:15 Error: error waiting for ECS Capacity Provider (arn:aws:ecs:eu-west-2:XXXXXXXX:capacity-provider/my-asg) to delete: timeout while waiting for state to become 'INACTIVE' (last state: 'ACTIVE', timeout: 20m0s)

@jybaek
Copy link

jybaek commented Feb 17, 2021

same issue in v0.14.6 😭

@baztian
Copy link

baztian commented Apr 8, 2021

For me the work around from @Zogoo did the trick.

aws autoscaling update-auto-scaling-group --auto-scaling-group-name "my-auto-scaling-group-name" --min-size 0 --desired-capacity 0

The other work around from @jm4games also works. To do it from aws cli:

aws ecs put-cluster-capacity-providers --cluster my-cluster --capacity-providers [] --default-capacity-provider-strategy []

@brikis98
Copy link
Contributor

Having this issue too. On destroy, I get the error:

Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

This started around Terraform 0.12, and we added retries to work around it. We're now upgrading to 0.15, and the retries no longer seem to help, so this is a blocker.

@bharti8085
Copy link

I also have same issue and received below error.

Error: error waiting for ECS Capacity Provider (arn:aws:ecs:eu-west-1:account-id:capacity-provider/asg-ec2-cp) to delete: timeout while waiting for resource to be gone (last state: 'ACTIVE', timeout: 20m0s)

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

do we have any fix for this?

@justinretzolk
Copy link
Member

Hi all 👋 Thanks for taking the time to submit this issue and for the ongoing discussion. It looks like this is a duplicate of #11409. We like to try to keep discussions consolidated, and while this issue was filed first, the other one has more reactions (something we use to help gauge community interest in an issue/PR), and a suggested workaround. With that in mind, we’re going to close this new issue in favor of #11409.

lucamilanesio pushed a commit to GerritCodeReview/aws-gerrit that referenced this issue Nov 30, 2021
Deleting stacks using ECS clusters having capacityProviders (i.e.
dual-primary and primary-replica recipes), fails with:

```
The Cluster cannot be deleted while Container Instances are active
or draining.
```

This is an issue that manifests itself as well via terraform [1] or CDK
[2].

Explicitly deleting the Autoscaling Groups _before_ the ECS cluster
deletion fixes the problem, since it ensures that no instances are
active or draining, as the error suggests.

This is safe to do, because prior to deleting the Autoscaling Groups,
every ECS service has already been destroyed, thus no instance is
actually running.

[1] hashicorp/terraform-provider-aws#4852
[2] aws/aws-cdk#14732
Bug: Issue 14698
Change-Id: I216307ef88bd7b7317706d2dc0a6a6e6fb367bd4

Change-Id: I27ece0f6971b157a474d91d7f3d9243dcff596e6
@github-actions
Copy link

github-actions bot commented Jun 3, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service.
Projects
None yet
Development

No branches or pull requests