Destroying aws_ecs_service fails with timeout #2902

chelarua · 2015-07-31T08:48:20Z

Hello,
I am having some trouble destroying an ecs based setup. It always fails when trying to destroy the aws_ecs_service.
The error is:

aws_ecs_service.web_server: Destroying...
aws_ecs_service.web_server: Error: 1 error(s) occurred:

* timeout while waiting for state to become 'INACTIVE'
Error applying plan:

1 error(s) occurred:

* timeout while waiting for state to become 'INACTIVE'

Retrying the destroy multiple times, still ends up with timeout every time.

The other weird thing is if i clean up everything manually and do a refresh from terraform, it sees everything else as gone, except for the service, although listing the services from aws cli shows nothing

aws ecs list-services
{
    "serviceArns": []
}

The setup contains a ecs cluster, a container instance, a task definition, a ecs service, a load balancer connected to the ecs service and a vpc

The text was updated successfully, but these errors were encountered:

radeksimko · 2015-07-31T08:59:21Z

Thanks for the report,
it should make reproduction easier if you get the same error all the time - that's better than intermittent errors.

Could you please provide full debug log (TF_LOG=1 TF_LOG_PATH=tf.log terraform destroy) + Terraform code used minus any secrets? If you don't have time to separate secrets and TF code, then debug log will be still very helpful.

Also which Terraform version are you using at the moment?

This might be just a simple timeout issue which can be fixed by simply increasing the timeout (currently 5 mins) or a dependency hell and I'd like to reproduce it in the first place.

chelarua · 2015-07-31T10:05:33Z

Hi,
thanks for the swift response.

I must renounce my previous statement of this happening all the time, this morning it happened 5 times in a row, full setup creation and manual deletion after the destroy giving timeouts, but now it doesnt reproduce anymore. It might have been indeed caused by some slowness on the AWS part.

I'm using Terraform v0.6.1.

This is my code related to ecs:

resource "aws_ecs_service" "rq_web_api_elx_server" {
  name = "rq_web_api_elx_server"
  cluster = "${aws_ecs_cluster.rq_ecs_cluster.id}"
  task_definition = "${aws_ecs_task_definition.rq_web_api_elx_server.arn}"
  desired_count = 1
  iam_role = "${aws_iam_role.rq_ecs_role.id}"

  load_balancer {
    elb_name = "${aws_elb.rq_load_balancer.id}"
    container_name = "rq_web_api_elx_server"
    container_port = "${var.rq_web_api_elx_port}"
  }
  depends_on = ["aws_ecs_cluster.rq_ecs_cluster"]

}

resource "aws_elb" "rq_load_balancer" {
  name = "rqloadbalancer"

  security_groups = ["${aws_security_group.rq_elb_security_group.id}"]
  cross_zone_load_balancing = true
  subnets = ["${aws_subnet.rq_subnet.id}"]

  listener {
    instance_port = "${var.rq_web_api_elx_port}"
    instance_protocol = "http"
    ssl_certificate_id = "${var.api_ssl_certificate_id}"
    lb_port = 443
    lb_protocol = "https"
  }

  health_check {
    healthy_threshold = 2
    unhealthy_threshold = 5
    timeout = 10
    target = "TCP:${var.rq_web_api_elx_port}"
    interval = 30
  }
}

resource "aws_ecs_task_definition" "rq_web_api_elx_server" {
  family = "rq_web_api_elx_server"
  depends_on = ["aws_ecs_cluster.rq_ecs_cluster", "aws_elb.rq_load_balancer"]
}

resource "aws_ecs_cluster" "rq_ecs_cluster" {
  name = "rq_ecs_cluster"
}

resource "aws_instance" "rq_container_instance" {
    ami = "${lookup(var.ecs_amis, var.region)}"
    availability_zone = "${var.availability_zone}"
    instance_type = "t2.micro"
    key_name = "${var.ssh_key_name}"
    security_groups = ["${aws_security_group.rq_ecs_security_group.id}"]
    subnet_id = "${aws_subnet.rq_subnet.id}"
    associate_public_ip_address = true
    source_dest_check = false
    iam_instance_profile = "rq_ecs_profile"
    user_data = "${file("config-ecs")}"
    tags {
        Name = "rq_container_instance"
    }
    depends_on = ["aws_iam_instance_profile.rq_ecs_profile"]
}

I'll get back with the log if i see this happening again.

chelarua · 2015-07-31T11:52:35Z

Managed to reproduce it again, this is the debug log
https://gist.github.com/chelarua/658ad6a3e1b9be871756

radeksimko · 2015-08-23T16:59:52Z

@chelarua When this happens again, can you try and check

aws ecs describe-services --cluster=<your-cluster-name> --services=<your-service-name> --region=<your-aws-region>

for me and see what's inside "events"?

I did manage to reproduce this when creating & destroying the whole stack very quickly (e.g. in acceptance tests). ECS service remains in DRAINING state, having this in events:

"events": [
                {
                    "message": "(service sampletest) failed to describe instance health on (elb foobar-terraform-test) with (error User: arn:aws:sts::714610209185:assumed-role/EcsService/ecs-service-scheduler is not authorized to perform: elasticloadbalancing:DescribeInstanceHealth)",
                    "id": "2c45138a-2512-457d-904d-a9f1c2c63169",
                    "createdAt": 1440348878.758
                }
]

It is not possible to remove that service (i.e. get it into INACTIVE state) until I add the IAM policy back.

The only simple solution I can think of is being added in #3061 , specifically in MeredithCorpOSS@9c2a3e7

dmikalova · 2017-10-22T17:09:25Z

I was able to solve the inactive task definition issue with the example in the ECS task definition data source. You set up the ECS service resource to use the the max revision of either what your Terraform resource has created, or what is in the AWS console which the data source retrieves.

The one downside to this is if someone changes the task definition, Terraform will not realign that to what's defined in code.

ghost · 2020-04-06T02:34:32Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

radeksimko added bug waiting-response An issue/pull request is waiting for a response from the community provider/aws labels Jul 31, 2015

radeksimko removed the waiting-response An issue/pull request is waiting for a response from the community label Jul 31, 2015

radeksimko mentioned this issue Aug 23, 2015

Various ECS bugfixes (IAM, destroy timeout) #3061

Merged

radeksimko closed this as completed in 9c2a3e7 Aug 25, 2015

CyrusNajmabadi mentioned this issue Mar 11, 2019

Wait for service to come up before tearing it down pulumi/examples#257

Merged

ghost locked and limited conversation to collaborators Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destroying aws_ecs_service fails with timeout #2902

Destroying aws_ecs_service fails with timeout #2902

chelarua commented Jul 31, 2015

radeksimko commented Jul 31, 2015

chelarua commented Jul 31, 2015

chelarua commented Jul 31, 2015

radeksimko commented Aug 23, 2015

dmikalova commented Oct 22, 2017

ghost commented Apr 6, 2020

Destroying aws_ecs_service fails with timeout #2902

Destroying aws_ecs_service fails with timeout #2902

Comments

chelarua commented Jul 31, 2015

radeksimko commented Jul 31, 2015

chelarua commented Jul 31, 2015

chelarua commented Jul 31, 2015

radeksimko commented Aug 23, 2015

dmikalova commented Oct 22, 2017

ghost commented Apr 6, 2020