aws_ecs_cluster with capacity_providers cannot be destroyed #11409

lukedd · 2019-12-22T11:09:31Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Relates #5278
Relates #11351
Relates #11531
Relates #22672
Relates #22754

Maintainer Note

Fixing this problem involves several pieces:
1. Creating a new resource that avoids the spurious capacity providers dependency chain (and allows capacity providers association with existing clusters), which is New Resource: aws_ecs_cluster_capacity_providers #22672
2. Deprecating the capacity_providers and default_capacity_provider_strategy arguments of aws_ecs_cluster (aws_ecs_cluster: Deprecate capacity_providers and default_capacity_provider_strategy #22754)
3. Removing the capacity_providers and default_capacity_provider_strategy arguments from aws_ecs_cluster, which is a breaking change
While the complete solution includes a breaking change, that doesn't prevent us from moving forward with i. and ii. (v4.0) above and then keeping iii. in mind for v5.0.

Terraform Version

Terraform v0.12.18

provider.aws v2.43.0

Affected Resource(s)

aws_ecs_cluster
aws_ecs_capacity_provider
aws_autoscaling_group

Terraform Configuration Files

resource "aws_ecs_cluster" "indestructable" { 
  name = "show_tf_cp_flaw"

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }
}

resource "aws_ecs_capacity_provider" "cp" {
  name = "show_tf_cp_flaw"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.asg.arn

    managed_scaling {
      status          = "ENABLED"
      target_capacity = 80
    }
  }
}

resource "aws_autoscaling_group" "asg" {
  min_size = 2
  ....
}

Debug Output

Panic Output

Expected Behavior

terraform destroy should be able to destroy an aws_ecs_cluster which has capacity_providers set.

Actual Behavior

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

The problem is that this new capacity_provider property on the aws_ecs_cluster introduces a new dependency:
aws_ecs_cluster
depends on aws_ecs_capacity_provider
depends on aws_autoscaling_group

This causes terraform to destroy the ECS cluster before the autoscaling group, which is the wrong way around: the autoscaling group must be destroyed first because the cluster must contain zero instances before it can be destroyed.

A possible solution may be to introduce a new resource type representing the attachment of a capacity provider to a cluster (inspired by aws_iam_role_policy_attachment which is the attachment of an IAM policy to a role).

This would allow the following dependency graph which would work beautifully:
aws_ecs_capacity_provider_cluster_attachment
depends on aws_ecs_cluster and aws_ecs_capacity_provider;
aws_ecs_capacity_provider
depends on aws_autoscaling_group
depends on aws_launch_template
depends on aws_ecs_cluster (e.g. via the user_data property which needs to set the ECS_CLUSTER environment variable to the name of the cluster).

Steps to Reproduce

terraform apply
terraform destroy

Important Factoids

References

The problematic capacity_providers field on aws_ecs_cluster was added recently in Add support for ECS capacity providers #11150
Using aws_ecs_capacity_provider with managed_termination_protection = "ENABLED" requires that the aws_autoscaling_group has protect_from_scale_in enabled, which has a separate issue with destroy: Terraform fails to destroy autoscaling group if scale in protection is enabled #5278

The text was updated successfully, but these errors were encountered:

lukedd · 2019-12-22T11:41:52Z

Meanwhile here is a nasty workaround using a destroy provisioner, that worked for me to allow the aws_ecs_cluster to be destroyed:

resource "aws_ecs_cluster" "cluster" {
  name = local.cluster_name

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }

  # We need to terminate all instances before the cluster can be destroyed.
  # (Terraform would handle this automatically if the autoscaling group depended
  #  on the cluster, but we need to have the dependency in the reverse
  #  direction due to the capacity_providers field above).
  provisioner "local-exec" {
    when = destroy

    command = <<CMD
      # Get the list of capacity providers associated with this cluster
      CAP_PROVS="$(aws ecs describe-clusters --clusters "${self.arn}" \
        --query 'clusters[*].capacityProviders[*]' --output text)"

      # Now get the list of autoscaling groups from those capacity providers
      ASG_ARNS="$(aws ecs describe-capacity-providers \
        --capacity-providers "$CAP_PROVS" \
        --query 'capacityProviders[*].autoScalingGroupProvider.autoScalingGroupArn' \
        --output text)"

      if [ -n "$ASG_ARNS" ] && [ "$ASG_ARNS" != "None" ]
      then
        for ASG_ARN in $ASG_ARNS
        do
          ASG_NAME=$(echo $ASG_ARN | cut -d/ -f2-)

          # Set the autoscaling group size to zero
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name "$ASG_NAME" \
            --min-size 0 --max-size 0 --desired-capacity 0

          # Remove scale-in protection from all instances in the asg
          INSTANCES="$(aws autoscaling describe-auto-scaling-groups \
            --auto-scaling-group-names "$ASG_NAME" \
            --query 'AutoScalingGroups[*].Instances[*].InstanceId' \
            --output text)"
          aws autoscaling set-instance-protection --instance-ids $INSTANCES \
            --auto-scaling-group-name "$ASG_NAME" \
            --no-protected-from-scale-in
        done
      fi
CMD
  }
}

kkost · 2020-06-16T23:11:07Z

Any updates here? This is terribly annoying to deal with. (The workaround does not work in my particular case)

carlitos081 · 2020-06-30T13:25:01Z

Any news? I still waiting for this issue to be fixed

tomelliff · 2020-08-13T16:51:16Z

If you don't link the cluster to the capacity provider as a dependency and just use the name as a string does that fix the issue? It's not great but as long as you can delete a capacity provider while the ASG it's linked to has instances then that would work.

kaueraal · 2020-09-29T12:13:47Z

@tomelliff Unfortunately this has another issue during construction:
Error: InvalidParameterException: The specified capacity provider 'XXX' is not in an ACTIVE state. Specify a valid capacity provider and try again.

jobinmarygeorge · 2021-03-05T07:32:57Z

any updates here ?

brikis98 · 2021-04-28T11:46:57Z

Having this issue too. On destroy, I get the error:

Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

This started around Terraform 0.12, and we added retries to work around it. We're now upgrading to 0.15, and the retries no longer seem to help, so this is a blocker.

brikis98 · 2021-04-30T09:40:14Z

Ah, turns out this is precisely the issue described in #11531. In short, the design of capacity providers is broken in Terraform right now, as it creates an invalid dependency chain: aws_ecs_cluster -> aws_ecs_capacity_provider -> aws_autoscaling_group. This chain isn't valid, because on destroy, Terraform will try to delete aws_ecs_cluster first, but it can't, because the aws_autoscaling_group hasn't been deleted. So we need an aws_ecs_capacity_provider_attachment to use capacity providers without such a dependency chain.

jdales-mt · 2021-08-02T08:41:53Z

For a less invasive update, the aws_ecs_cluster resource could just implement the process provided by Luke in his provisioner shell script when the cluster has capacity providers.

Perhaps not quite as elegant as introducing a new attachment kind of resource, but will fix the problem.

It seems that the introduction of capacity provides has created all sorts of issues for AWS Cloudformation as well as Terraform engineers.

ericb-summit · 2021-10-08T21:11:00Z

Same as #4852 . Someone consolidate all these, this is really noisy.

samyak-jain · 2021-10-15T20:05:07Z

The workaround does not work if you're use terraform-aws-modules since we cannot add provisioners to modules.

YakDriver · 2021-10-19T21:46:50Z

Thank you for the input on this issue! We are carefully considering work on this in the near future. (No guarantees on an exact date.) In order to facilitate the implementation, I've outlined some thoughts below.

After looking through this, I agree with the suggested way forward:

Create a new resource (something along the lines of aws_ecs_capacity_provider_cluster_attachment)
The new resource could overlap existing functionality. However, removing the capacity_providers and default_capacity_provider_strategy arguments would be a breaking change that would need to wait until a major release. Please comment below on whether ideally these would stay or go.
It would be highly desirable for this implementation to also fix Terraform fails to destroy autoscaling group if scale in protection is enabled #5278, Updating ECS service capacity provider strategy replaces resource #11351, and aws_ecs_capacity_provider_attachment #11531, if possible.

Please provide any feedback, yay or nay.

breathingdust · 2021-11-10T17:28:22Z

Hi all 👋 Just letting you know that this is issue is featured on this quarters roadmap. If a PR exists to close the issue a maintainer will review and either make changes directly, or work with the original author to get the contribution merged. If you have written a PR to resolve the issue please ensure the "Allow edits from maintainers" box is checked. Thanks for your patience and we are looking forward to getting this merged soon!

YakDriver · 2022-01-24T23:51:26Z

I'm having trouble reproducing this issue against the latest version. Below is the configuration I used, which is taken from the op config. If you have suggestions on reproducing the issue, let me know. Otherwise, we'll plan to close this issue as resolved. (See update below.) Thanks to @roberth-k, this is a complete config to reproduce that does not rely on the default VPC.

provider "aws" {}

locals {
  // cluster_name is a local to avoid the cyclical dependency:
  // cluster -> capacity provider -> asg -> launch template -> user data -> cluster.
  cluster_name = random_pet.name.id
}

data "aws_availability_zones" "current" {
  state = "available"

  filter {
    name   = "opt-in-status"
    values = ["opt-in-not-required"]
  }
}

resource "random_pet" "name" {}

data "aws_ami" "test" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-ecs-hvm-2.0.*-x86_64-ebs"]
  }
}

resource "aws_vpc" "test" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = random_pet.name.id
  }
}

resource "aws_subnet" "test" {
  vpc_id                  = aws_vpc.test.id
  cidr_block              = "10.0.0.0/24"
  map_public_ip_on_launch = true

  tags = {
    Name = random_pet.name.id
  }
}

resource "aws_route_table" "main" {
  vpc_id = aws_vpc.test.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.test.id
}

resource "aws_route_table_association" "main" {
  route_table_id = aws_route_table.main.id
  subnet_id      = aws_subnet.test.id
}

resource "aws_security_group" "test" {
  vpc_id = aws_vpc.test.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = random_pet.name.id
  }
}

resource "aws_ecs_cluster" "test" {
  name = local.cluster_name

  capacity_providers = [aws_ecs_capacity_provider.test.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.test.name
  }
}

resource "aws_ecs_capacity_provider" "test" {
  name = random_pet.name.id

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.test.arn
  }
}

resource "aws_iam_role" "test" {
  name = random_pet.name.id

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = {
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
      Action = "sts:AssumeRole"
    }
  })
}

resource "aws_iam_role_policy_attachment" "test" {
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
  role       = aws_iam_role.test.id
}

resource "aws_iam_instance_profile" "test" {
  depends_on = [aws_iam_role_policy_attachment.test]
  role       = aws_iam_role.test.name
}

resource "aws_launch_template" "test" {
  image_id                             = data.aws_ami.test.id
  instance_type                        = "t3.micro"
  instance_initiated_shutdown_behavior = "terminate"
  vpc_security_group_ids               = [aws_security_group.test.id]

  iam_instance_profile {
    name = aws_iam_instance_profile.test.name
  }

  user_data = base64encode(<<EOL
#!/bin/bash
echo "ECS_CLUSTER=${local.cluster_name}" >> /etc/ecs/ecs.config
EOL
  )
}

resource "aws_autoscaling_group" "test" {
  desired_capacity    = 1
  max_size            = 2
  min_size            = 1
  name                = random_pet.name.id
  vpc_zone_identifier = [aws_subnet.test.id]

  instance_refresh {
    strategy = "Rolling"
  }

  launch_template {
    id      = aws_launch_template.test.id
    version = aws_launch_template.test.latest_version
  }

  tags = [{
    key                 = "foo"
    value               = "bar"
    propagate_at_launch = true
  }]
}

terraform apply works fine but terraform destroy times out and then gives Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

edmundcraske-bjss · 2022-01-25T08:54:11Z

Is it possible that your test did not hit the issue because the EC2 instances were not actually registering with the ECS cluster?

YakDriver · 2022-01-25T14:46:30Z

@edmundcraske-bjss Yes, you are absolutely correct! Thank you.

At this point, the best way forward looks like #22672. That will address the op recommended solution of an attachment resource (though named aws_ecs_cluster_capacity_providers instead). That will solve the main problem here of using capacity providers but not being able to destroy the cluster. It will also solve the problem of not being able to associate capacity providers with existing clusters.

github-actions · 2022-01-28T01:34:12Z

This functionality has been released in v3.74.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions · 2022-05-17T02:33:42Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ghost added service/autoscaling Issues and PRs that pertain to the autoscaling service. service/ecs Issues and PRs that pertain to the ecs service. labels Dec 22, 2019

github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Dec 22, 2019

bflad added needs-triage Waiting for first response or review from a maintainer. and removed needs-triage Waiting for first response or review from a maintainer. labels Jan 2, 2020

lukedd mentioned this issue Jan 15, 2020

aws_ecs_capacity_provider_attachment #11531

Closed

bflad added new-resource Introduces a new resource. and removed needs-triage Waiting for first response or review from a maintainer. service/autoscaling Issues and PRs that pertain to the autoscaling service. labels Apr 9, 2020

bflad self-assigned this Apr 9, 2020

breathingdust unassigned bflad May 3, 2021

sunggun1 mentioned this issue Oct 13, 2021

local-exec variables is not recognized as an internal or extenral command on windows 10 #21271

Closed

justinretzolk mentioned this issue Oct 13, 2021

terraform attempts to destroy AWS ECS cluster before Deleting ECS Service #4852

Closed

justinretzolk added the bug Addresses a defect in current functionality. label Oct 13, 2021

This was referenced Oct 19, 2021

Updating ECS service capacity provider strategy replaces resource #11351

Closed

Terraform fails to destroy autoscaling group if scale in protection is enabled #5278

Closed

breathingdust added this to the Roadmap milestone Nov 10, 2021

roberth-k mentioned this issue Jan 20, 2022

New Resource: aws_ecs_cluster_capacity_providers #22672

Merged

YakDriver added the breaking-change Introduces a breaking change in current functionality; usually deferred to the next major release. label Jan 24, 2022

YakDriver self-assigned this Jan 24, 2022

YakDriver added the waiting-response Maintainers are waiting on response from community or contributor. label Jan 25, 2022

github-actions bot removed the waiting-response Maintainers are waiting on response from community or contributor. label Jan 25, 2022

YakDriver mentioned this issue Jan 25, 2022

aws_ecs_cluster: Deprecate capacity_providers and default_capacity_provider_strategy #22754

Closed

roberth-k added a commit to roberth-k/terraform-provider-aws that referenced this issue Jan 26, 2022

r/aws_ecs_cluster_capacity_providers: add test for hashicorp#11409

5f2b92e

YakDriver closed this as completed in #22672 Jan 26, 2022

github-actions bot modified the milestones: Roadmap, v3.74.0 Jan 26, 2022

github-actions bot locked as resolved and limited conversation to collaborators May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_ecs_cluster with capacity_providers cannot be destroyed #11409

aws_ecs_cluster with capacity_providers cannot be destroyed #11409

lukedd commented Dec 22, 2019 •

edited by YakDriver

Loading

lukedd commented Dec 22, 2019

kkost commented Jun 16, 2020

carlitos081 commented Jun 30, 2020

tomelliff commented Aug 13, 2020

kaueraal commented Sep 29, 2020

jobinmarygeorge commented Mar 5, 2021

brikis98 commented Apr 28, 2021

brikis98 commented Apr 30, 2021

jdales-mt commented Aug 2, 2021

ericb-summit commented Oct 8, 2021

samyak-jain commented Oct 15, 2021

YakDriver commented Oct 19, 2021

breathingdust commented Nov 10, 2021

YakDriver commented Jan 24, 2022 •

edited

Loading

edmundcraske-bjss commented Jan 25, 2022

YakDriver commented Jan 25, 2022 •

edited

Loading

github-actions bot commented Jan 28, 2022

github-actions bot commented May 17, 2022

aws_ecs_cluster with capacity_providers cannot be destroyed #11409

aws_ecs_cluster with capacity_providers cannot be destroyed #11409

Comments

lukedd commented Dec 22, 2019 • edited by YakDriver Loading

Community Note

Maintainer Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

lukedd commented Dec 22, 2019

kkost commented Jun 16, 2020

carlitos081 commented Jun 30, 2020

tomelliff commented Aug 13, 2020

kaueraal commented Sep 29, 2020

jobinmarygeorge commented Mar 5, 2021

brikis98 commented Apr 28, 2021

brikis98 commented Apr 30, 2021

jdales-mt commented Aug 2, 2021

ericb-summit commented Oct 8, 2021

samyak-jain commented Oct 15, 2021

YakDriver commented Oct 19, 2021

breathingdust commented Nov 10, 2021

YakDriver commented Jan 24, 2022 • edited Loading

edmundcraske-bjss commented Jan 25, 2022

YakDriver commented Jan 25, 2022 • edited Loading

github-actions bot commented Jan 28, 2022

github-actions bot commented May 17, 2022

lukedd commented Dec 22, 2019 •

edited by YakDriver

Loading

YakDriver commented Jan 24, 2022 •

edited

Loading

YakDriver commented Jan 25, 2022 •

edited

Loading