Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_ecs_cluster with capacity_providers cannot be destroyed #11409

Closed
lukedd opened this issue Dec 22, 2019 · 18 comments · Fixed by #22672
Closed

aws_ecs_cluster with capacity_providers cannot be destroyed #11409

lukedd opened this issue Dec 22, 2019 · 18 comments · Fixed by #22672
Assignees
Labels
breaking-change Introduces a breaking change in current functionality; usually deferred to the next major release. bug Addresses a defect in current functionality. new-resource Introduces a new resource. service/ecs Issues and PRs that pertain to the ecs service.
Milestone

Comments

@lukedd
Copy link

lukedd commented Dec 22, 2019

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Relates #5278
Relates #11351
Relates #11531
Relates #22672
Relates #22754

Maintainer Note

  • Fixing this problem involves several pieces:
    1. Creating a new resource that avoids the spurious capacity providers dependency chain (and allows capacity providers association with existing clusters), which is New Resource: aws_ecs_cluster_capacity_providers #22672
    2. Deprecating the capacity_providers and default_capacity_provider_strategy arguments of aws_ecs_cluster (aws_ecs_cluster: Deprecate capacity_providers and default_capacity_provider_strategy #22754)
    3. Removing the capacity_providers and default_capacity_provider_strategy arguments from aws_ecs_cluster, which is a breaking change
  • While the complete solution includes a breaking change, that doesn't prevent us from moving forward with i. and ii. (v4.0) above and then keeping iii. in mind for v5.0.

Terraform Version

Terraform v0.12.18

  • provider.aws v2.43.0

Affected Resource(s)

  • aws_ecs_cluster
  • aws_ecs_capacity_provider
  • aws_autoscaling_group

Terraform Configuration Files

resource "aws_ecs_cluster" "indestructable" { 
  name = "show_tf_cp_flaw"

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }
}

resource "aws_ecs_capacity_provider" "cp" {
  name = "show_tf_cp_flaw"

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.asg.arn

    managed_scaling {
      status          = "ENABLED"
      target_capacity = 80
    }
  }
}

resource "aws_autoscaling_group" "asg" {
  min_size = 2
  ....
}

Debug Output

Panic Output

Expected Behavior

terraform destroy should be able to destroy an aws_ecs_cluster which has capacity_providers set.

Actual Behavior

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

The problem is that this new capacity_provider property on the aws_ecs_cluster introduces a new dependency:
aws_ecs_cluster
depends on aws_ecs_capacity_provider
depends on aws_autoscaling_group

This causes terraform to destroy the ECS cluster before the autoscaling group, which is the wrong way around: the autoscaling group must be destroyed first because the cluster must contain zero instances before it can be destroyed.

A possible solution may be to introduce a new resource type representing the attachment of a capacity provider to a cluster (inspired by aws_iam_role_policy_attachment which is the attachment of an IAM policy to a role).

This would allow the following dependency graph which would work beautifully:
aws_ecs_capacity_provider_cluster_attachment
depends on aws_ecs_cluster and aws_ecs_capacity_provider;
aws_ecs_capacity_provider
depends on aws_autoscaling_group
depends on aws_launch_template
depends on aws_ecs_cluster (e.g. via the user_data property which needs to set the ECS_CLUSTER environment variable to the name of the cluster).

Steps to Reproduce

  1. terraform apply
  2. terraform destroy

Important Factoids

References

@ghost ghost added service/autoscaling Issues and PRs that pertain to the autoscaling service. service/ecs Issues and PRs that pertain to the ecs service. labels Dec 22, 2019
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Dec 22, 2019
@lukedd
Copy link
Author

lukedd commented Dec 22, 2019

Meanwhile here is a nasty workaround using a destroy provisioner, that worked for me to allow the aws_ecs_cluster to be destroyed:

resource "aws_ecs_cluster" "cluster" {
  name = local.cluster_name

  capacity_providers = [aws_ecs_capacity_provider.cp.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.cp.name
  }

  # We need to terminate all instances before the cluster can be destroyed.
  # (Terraform would handle this automatically if the autoscaling group depended
  #  on the cluster, but we need to have the dependency in the reverse
  #  direction due to the capacity_providers field above).
  provisioner "local-exec" {
    when = destroy

    command = <<CMD
      # Get the list of capacity providers associated with this cluster
      CAP_PROVS="$(aws ecs describe-clusters --clusters "${self.arn}" \
        --query 'clusters[*].capacityProviders[*]' --output text)"

      # Now get the list of autoscaling groups from those capacity providers
      ASG_ARNS="$(aws ecs describe-capacity-providers \
        --capacity-providers "$CAP_PROVS" \
        --query 'capacityProviders[*].autoScalingGroupProvider.autoScalingGroupArn' \
        --output text)"

      if [ -n "$ASG_ARNS" ] && [ "$ASG_ARNS" != "None" ]
      then
        for ASG_ARN in $ASG_ARNS
        do
          ASG_NAME=$(echo $ASG_ARN | cut -d/ -f2-)

          # Set the autoscaling group size to zero
          aws autoscaling update-auto-scaling-group \
            --auto-scaling-group-name "$ASG_NAME" \
            --min-size 0 --max-size 0 --desired-capacity 0

          # Remove scale-in protection from all instances in the asg
          INSTANCES="$(aws autoscaling describe-auto-scaling-groups \
            --auto-scaling-group-names "$ASG_NAME" \
            --query 'AutoScalingGroups[*].Instances[*].InstanceId' \
            --output text)"
          aws autoscaling set-instance-protection --instance-ids $INSTANCES \
            --auto-scaling-group-name "$ASG_NAME" \
            --no-protected-from-scale-in
        done
      fi
CMD
  }
}

@bflad bflad added needs-triage Waiting for first response or review from a maintainer. and removed needs-triage Waiting for first response or review from a maintainer. labels Jan 2, 2020
@bflad bflad added new-resource Introduces a new resource. and removed needs-triage Waiting for first response or review from a maintainer. service/autoscaling Issues and PRs that pertain to the autoscaling service. labels Apr 9, 2020
@bflad bflad self-assigned this Apr 9, 2020
@kkost
Copy link

kkost commented Jun 16, 2020

Any updates here? This is terribly annoying to deal with. (The workaround does not work in my particular case)

@carlitos081
Copy link

Any news? I still waiting for this issue to be fixed

@tomelliff
Copy link
Contributor

If you don't link the cluster to the capacity provider as a dependency and just use the name as a string does that fix the issue? It's not great but as long as you can delete a capacity provider while the ASG it's linked to has instances then that would work.

@kaueraal
Copy link

@tomelliff Unfortunately this has another issue during construction:
Error: InvalidParameterException: The specified capacity provider 'XXX' is not in an ACTIVE state. Specify a valid capacity provider and try again.

@jobinmarygeorge
Copy link

any updates here ?

@brikis98
Copy link
Contributor

Having this issue too. On destroy, I get the error:

Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

This started around Terraform 0.12, and we added retries to work around it. We're now upgrading to 0.15, and the retries no longer seem to help, so this is a blocker.

@brikis98
Copy link
Contributor

Ah, turns out this is precisely the issue described in #11531. In short, the design of capacity providers is broken in Terraform right now, as it creates an invalid dependency chain: aws_ecs_cluster -> aws_ecs_capacity_provider -> aws_autoscaling_group. This chain isn't valid, because on destroy, Terraform will try to delete aws_ecs_cluster first, but it can't, because the aws_autoscaling_group hasn't been deleted. So we need an aws_ecs_capacity_provider_attachment to use capacity providers without such a dependency chain.

@jdales-mt
Copy link

For a less invasive update, the aws_ecs_cluster resource could just implement the process provided by Luke in his provisioner shell script when the cluster has capacity providers.

Perhaps not quite as elegant as introducing a new attachment kind of resource, but will fix the problem.

It seems that the introduction of capacity provides has created all sorts of issues for AWS Cloudformation as well as Terraform engineers.

@ericb-summit
Copy link

Same as #4852 . Someone consolidate all these, this is really noisy.

@samyak-jain
Copy link

The workaround does not work if you're use terraform-aws-modules since we cannot add provisioners to modules.

@YakDriver
Copy link
Member

Thank you for the input on this issue! We are carefully considering work on this in the near future. (No guarantees on an exact date.) In order to facilitate the implementation, I've outlined some thoughts below.

After looking through this, I agree with the suggested way forward:

Please provide any feedback, yay or nay.

@breathingdust
Copy link
Member

Hi all 👋 Just letting you know that this is issue is featured on this quarters roadmap. If a PR exists to close the issue a maintainer will review and either make changes directly, or work with the original author to get the contribution merged. If you have written a PR to resolve the issue please ensure the "Allow edits from maintainers" box is checked. Thanks for your patience and we are looking forward to getting this merged soon!

@breathingdust breathingdust added this to the Roadmap milestone Nov 10, 2021
@YakDriver YakDriver added the breaking-change Introduces a breaking change in current functionality; usually deferred to the next major release. label Jan 24, 2022
@YakDriver YakDriver self-assigned this Jan 24, 2022
@YakDriver
Copy link
Member

YakDriver commented Jan 24, 2022

I'm having trouble reproducing this issue against the latest version. Below is the configuration I used, which is taken from the op config. If you have suggestions on reproducing the issue, let me know. Otherwise, we'll plan to close this issue as resolved. (See update below.) Thanks to @roberth-k, this is a complete config to reproduce that does not rely on the default VPC.

provider "aws" {}

locals {
  // cluster_name is a local to avoid the cyclical dependency:
  // cluster -> capacity provider -> asg -> launch template -> user data -> cluster.
  cluster_name = random_pet.name.id
}

data "aws_availability_zones" "current" {
  state = "available"

  filter {
    name   = "opt-in-status"
    values = ["opt-in-not-required"]
  }
}

resource "random_pet" "name" {}

data "aws_ami" "test" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-ecs-hvm-2.0.*-x86_64-ebs"]
  }
}

resource "aws_vpc" "test" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = random_pet.name.id
  }
}

resource "aws_subnet" "test" {
  vpc_id                  = aws_vpc.test.id
  cidr_block              = "10.0.0.0/24"
  map_public_ip_on_launch = true

  tags = {
    Name = random_pet.name.id
  }
}

resource "aws_route_table" "main" {
  vpc_id = aws_vpc.test.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.test.id
}

resource "aws_route_table_association" "main" {
  route_table_id = aws_route_table.main.id
  subnet_id      = aws_subnet.test.id
}

resource "aws_security_group" "test" {
  vpc_id = aws_vpc.test.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = random_pet.name.id
  }
}

resource "aws_ecs_cluster" "test" {
  name = local.cluster_name

  capacity_providers = [aws_ecs_capacity_provider.test.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.test.name
  }
}

resource "aws_ecs_capacity_provider" "test" {
  name = random_pet.name.id

  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.test.arn
  }
}

resource "aws_iam_role" "test" {
  name = random_pet.name.id

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = {
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
      Action = "sts:AssumeRole"
    }
  })
}

resource "aws_iam_role_policy_attachment" "test" {
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
  role       = aws_iam_role.test.id
}

resource "aws_iam_instance_profile" "test" {
  depends_on = [aws_iam_role_policy_attachment.test]
  role       = aws_iam_role.test.name
}

resource "aws_launch_template" "test" {
  image_id                             = data.aws_ami.test.id
  instance_type                        = "t3.micro"
  instance_initiated_shutdown_behavior = "terminate"
  vpc_security_group_ids               = [aws_security_group.test.id]

  iam_instance_profile {
    name = aws_iam_instance_profile.test.name
  }

  user_data = base64encode(<<EOL
#!/bin/bash
echo "ECS_CLUSTER=${local.cluster_name}" >> /etc/ecs/ecs.config
EOL
  )
}

resource "aws_autoscaling_group" "test" {
  desired_capacity    = 1
  max_size            = 2
  min_size            = 1
  name                = random_pet.name.id
  vpc_zone_identifier = [aws_subnet.test.id]

  instance_refresh {
    strategy = "Rolling"
  }

  launch_template {
    id      = aws_launch_template.test.id
    version = aws_launch_template.test.latest_version
  }

  tags = [{
    key                 = "foo"
    value               = "bar"
    propagate_at_launch = true
  }]
}

terraform apply works fine but terraform destroy times out and then gives Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.

@YakDriver YakDriver added the waiting-response Maintainers are waiting on response from community or contributor. label Jan 25, 2022
@edmundcraske-bjss
Copy link
Contributor

Is it possible that your test did not hit the issue because the EC2 instances were not actually registering with the ECS cluster?

@github-actions github-actions bot removed the waiting-response Maintainers are waiting on response from community or contributor. label Jan 25, 2022
@YakDriver
Copy link
Member

YakDriver commented Jan 25, 2022

@edmundcraske-bjss Yes, you are absolutely correct! Thank you.

At this point, the best way forward looks like #22672. That will address the op recommended solution of an attachment resource (though named aws_ecs_cluster_capacity_providers instead). That will solve the main problem here of using capacity providers but not being able to destroy the cluster. It will also solve the problem of not being able to associate capacity providers with existing clusters.

@github-actions
Copy link

This functionality has been released in v3.74.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
breaking-change Introduces a breaking change in current functionality; usually deferred to the next major release. bug Addresses a defect in current functionality. new-resource Introduces a new resource. service/ecs Issues and PRs that pertain to the ecs service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.