Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy aws_ecs_service.service on Fargate gets stuck #3414

Open
varas opened this issue Feb 16, 2018 · 12 comments
Open

Destroy aws_ecs_service.service on Fargate gets stuck #3414

varas opened this issue Feb 16, 2018 · 12 comments
Labels
bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service.

Comments

@varas
Copy link

varas commented Feb 16, 2018

Destroy gets stuck on resource aws_ecs_service on Fargate until you manually stop all the tasks.

Terraform Version

Terraform v0.11.3

  • provider.aws v1.9.0
  • provider.template v1.0.0

Affected Resource(s)

Please list the resources as a list, for example:

  • aws_ecs_service

Terraform Configuration Files

resource "aws_ecs_service" "service" {
  name            = "..."
  cluster         = "${aws_ecs_cluster.cluster.id}"
  task_definition = "${aws_ecs_task_definition.task.arn}"
  desired_count   = 1
  health_check_grace_period_seconds = 1

  load_balancer = {
    target_group_arn = "${aws_alb_target_group.main.arn}"
    container_name   = "..."
    container_port   = 5555
  }

  launch_type = "FARGATE"

  network_configuration {
    security_groups = ["${aws_security_group.awsvpc_sg.id}"]
    subnets         = ["${module.vpc.private_subnets}"]
  }

  depends_on = ["aws_alb_listener.main"]
}

Debug Output

aws_ecs_service.service: Still destroying... (ID: arn:aws:ecs:us-east-1:218277271359:service/blink, 10s elapsed)
aws_ecs_service.service: Still destroying... (ID: arn:aws:ecs:us-east-1:218277271359:service/blink, 20s elapsed)
...

Expected Behavior

In order to destroy the Fargate ECS tasks it should stop all the service tasks.

Actual Behavior

I gets stuck trying to destroy the resource.

Steps to Reproduce

Simple launch a Fargate cluster using launch_type = "FARGATE"

  1. terraform apply
  2. terraform destroy
@bflad bflad added service/ecs Issues and PRs that pertain to the ecs service. bug Addresses a defect in current functionality. and removed service/ecs Issues and PRs that pertain to the ecs service. labels Feb 21, 2018
@rnemec-ng
Copy link

Is this going to be looked into? Are there any workarounds? (preferably without manual intervention)
Thxnks

@marcotesch
Copy link
Contributor

The ecs_service resource delete operation still does a draining of tasks within a service.

This might not be an open issue anymore? @bflad ?

@nwade615
Copy link

This is happening to me, as well. The CLI gets stuck on aws_ecs_service.api: Still destroying.... In the AWS console, the ECS service appears destroyed, but the running tasks remain. Strangely, it only happens with one of my Fargate services, not all of them. I must manually stop the tasks in the console for the destroy to continue.

Terraform v0.11.13
provider.aws v2.2.0

@bavibm
Copy link

bavibm commented Aug 14, 2019

Hello all, I'm also getting this issue with Terraform v0.12.6 and AWSProvider v2.23.0

This is my ECS configuration, excluding load balancer and other network-related resources (replacing details with "X"):

# ecs.tf

resource "aws_ecs_cluster" "X" {
  name = var.name_prefix
}

resource "aws_ecs_task_definition" "X" {
  family                   = "${var.name_prefix}-X"
  execution_role_arn       = "arn:aws:iam::X:role/ecsTaskExecutionRole"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]

  cpu                   = 1024
  memory                = 2048
  container_definitions = file("${path.module}/task-definitions/X.json")

}

resource "aws_ecs_service" "X" {
  name            = "${var.name_prefix}-X"
  cluster         = aws_ecs_cluster.X.id
  task_definition = aws_ecs_task_definition.X.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    security_groups = [aws_security_group.X.id]
    subnets         = [var.service_subnet_id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.X.id
    container_name   = "X"
    container_port   = var.X
  }

  depends_on = [aws_lb_listener.X]
}

Once it starts to destroy my ecs resources, it hangs at aws_ecs_service.X still destroying...
I have to manually go into the ECS management console and stop the running tasks in this service, cancel my Terraform destroy, and re-issue the command for it to work.

I am currently looking into using the local-exec provisioner to execute AWS CLI commands on the destroy stage for the service resource in order to automatically stop all tasks running in it as a workaround.

@bavibm
Copy link

bavibm commented Aug 15, 2019

So I managed to get the aforementioned workaround working for my specific case, I created a shell script that gets executed by Terraform to stop the EC2 task before destroying the service so that it won't get stuck. This requires the AWS CLI to be installed and configured on the same machine.

Here is what it looks like:

#!/usr/bin/env bash

if [ -z ${REGION} ] || [ -z ${CLUSTER} ] || [ -z ${SERVICE} ]; then
  echo "Please specify a region, cluster name, and service name..."
  exit 1
fi

if ! [ -x "$(command -v aws)" ]; then
  echo "The AWS CLI not installed..."
  >&2
  exit 1
else
  echo "AWS CLI found!"
fi

aws ecs list-tasks \
  --region ${REGION} \
  --cluster ${CLUSTER} \
  --service-name ${SERVICE} \
  --output text \
  >$(dirname $0)/.tasks

IFS=$'\n'
arns=($(awk '/TASKARNS/ {print $2}' $(dirname $0)/.tasks))

rm $(dirname $0)/.tasks

Copy and paste the script somewhere in your module or root such as in {module}/scripts/stop-tasks.sh and inside your ecs service resource add the local-exec provisioner so it looks something like this:

resource "aws_ecs_service" "X" {

  ...

  provisioner "local-exec" {
    when = "destroy"
    command = "${path.module}/scripts/stop-tasks.sh > ${path.module}/scripts/stop-tasks.out"
    environment = {
      REGION = var.region,
      CLUSTER = aws_ecs_cluster.X.name,
      SERVICE = aws_ecs_service.X.name
    }
  }
}

I haven't tested it in other situations, but feel free to use and modify at your leisure! I hope this issue gets fixed soon

@sethhochberg
Copy link

sethhochberg commented Nov 18, 2020

Following up with another possible workaround, for any who need it. We took inspiration from @bavibm's solution and implemented a destroy provisioner on the cluster resource which stops all tasks, idles the service, and waits for things to reach a state where the cluster itself can be destroyed.

The important part of your script:

SERVICES="$(aws ecs list-services --cluster "${CLUSTER}" | grep "${CLUSTER}" || true | sed -e 's/"//g' -e 's/,//')"
for SERVICE in $SERVICES ; do
  # Idle the service that spawns tasks
  aws ecs update-service --cluster "${CLUSTER}" --service "${SERVICE}" --desired-count 0

  # Stop running tasks
  TASKS="$(aws ecs list-tasks --cluster "${CLUSTER}" --service "${SERVICE}" | grep "${CLUSTER}" || true | sed -e 's/"//g' -e 's/,//')"
  for TASK in $TASKS; do
    aws ecs stop-task --task "$TASK"
  done

  # Delete the service after it becomes inactive
  aws ecs wait services-inactive --cluster "${CLUSTER}" --service "${SERVICE}"
  aws ecs delete-service --cluster "${CLUSTER}" --service "${SERVICE}"
done

Your cluster definition:

resource "aws_ecs_cluster" "whatevername" {
  name = "whatever_cluster_name"

  provisioner "local-exec" {
    when = destroy
    command = "${path.module}/scripts/stop-tasks.sh"
    environment = {
      CLUSTER = self.name
    }
  }
}

Because of hashicorp/terraform#23679, we are only relying on self references to pass into the cleanup task, and discover the rest based on the cluster data available via the AWS CLI. Our AWS profile and region are set via other configuration on the host that executes the script.

@bclabs-kylian
Copy link

this happened to me as well. i had to delete ECS security group from the RDS security group manually.

@moazzamk
Copy link

This is happening to me. If I try to delete the security group manually (through Amazon console), it says it is being used by a network interface. If I try to delete the network interface, it says it is being used by the security group.

@davidbudnick
Copy link

davidbudnick commented Feb 12, 2024

Still happening any solution?

module.ecs.aws_ecs_service.keep_ui_service_staging: Still destroying... [id=arn:aws:ecs:us-east-1:905418292571:serv...luster-staging/keep-ui-service-staging, 1m0s elapsed]
module.ecs.aws_ecs_service.keep_ui_service_staging: Still destroying... [id=arn:aws:ecs:us-east-1:905418292571:serv...luster-staging/keep-ui-service-staging, 1m10s elapsed]

(It hit almost 6 mins before I manually killed the job)

Manually required to run:
terraform state rm module.ecs.aws_ecs_service.keep_ui_service_staging

@davidbudnick
Copy link

davidbudnick commented Feb 12, 2024

Update:

Seems as the team is aware of the issue and have suggested adding a depends_on for the policy:
REF: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ecs_service

I was able to get it working here without having to add any extra scripts. Finished in around 2m40s. (Could be related to the timeout of the container while draining)

Edit: Looks as per the docs:

The following target group attributes are supported. You can modify these attributes only if the target group type is instance or ip. If the target group type is alb, these attributes always use their default values.
RE: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html

Therefore it will be 300 seconds by default but the bonus is the resource deletes without having to manually stop the job 🥳

Overall if you don't add the depends_on for the policy it will never finish.

Most likely the issue can be closed 📕

@javierguzman
Copy link

javierguzman commented Apr 22, 2024

I "fixed" this by setting the desired count to zero, similar to what others as previously done:

provisioner "local-exec" {
    when = destroy
    command = <<EOF
    echo "Update service desired count to 0 before destroy."
    REGION=${split(":", self.cluster)[3]}
    aws ecs update-service --region $REGION --cluster ${self.cluster} --service ${self.name} --desired-count 0 --force-new-deployment
    echo "Update service command executed successfully."
    EOF
  }

  timeouts {
    delete = "5m"
  }

I guess this should be done automatically by the provider.

@julianevanneeleman
Copy link

julianevanneeleman commented Jul 18, 2024

If your service has a static desired_count, an alternative work-around could be to use an aws_appautoscaling_target:

resource "aws_appautoscaling_target" "static_capacity" {
  service_namespace  = "ecs"
  resource_id        = "service/${aws_ecs_cluster.my_cluster.id}/${aws_ecs_service.my_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 1
  max_capacity       = 1
}

Set min_capacity and max_capacity to whatever value you want desired_count to be. Make sure you remove the desired_count value from your aws_ecs_service (or set it to 0), and add the following to prevent configuration drift:

lifecycle {
  ignore_changes = [desired_count]
}

This should make terraform destroy succeed in a single pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service.
Projects
None yet
Development

No branches or pull requests