Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroying aws_ecs_service fails with timeout #2902

Closed
chelarua opened this issue Jul 31, 2015 · 6 comments
Closed

Destroying aws_ecs_service fails with timeout #2902

chelarua opened this issue Jul 31, 2015 · 6 comments

Comments

@chelarua
Copy link
Contributor

Hello,
I am having some trouble destroying an ecs based setup. It always fails when trying to destroy the aws_ecs_service.
The error is:

aws_ecs_service.web_server: Destroying...
aws_ecs_service.web_server: Error: 1 error(s) occurred:

* timeout while waiting for state to become 'INACTIVE'
Error applying plan:

1 error(s) occurred:

* timeout while waiting for state to become 'INACTIVE'

Retrying the destroy multiple times, still ends up with timeout every time.

The other weird thing is if i clean up everything manually and do a refresh from terraform, it sees everything else as gone, except for the service, although listing the services from aws cli shows nothing

aws ecs list-services
{
    "serviceArns": []
}

The setup contains a ecs cluster, a container instance, a task definition, a ecs service, a load balancer connected to the ecs service and a vpc

@radeksimko
Copy link
Member

Thanks for the report,
it should make reproduction easier if you get the same error all the time - that's better than intermittent errors.

Could you please provide full debug log (TF_LOG=1 TF_LOG_PATH=tf.log terraform destroy) + Terraform code used minus any secrets? If you don't have time to separate secrets and TF code, then debug log will be still very helpful.

Also which Terraform version are you using at the moment?

This might be just a simple timeout issue which can be fixed by simply increasing the timeout (currently 5 mins) or a dependency hell and I'd like to reproduce it in the first place.

@radeksimko radeksimko added bug waiting-response An issue/pull request is waiting for a response from the community provider/aws labels Jul 31, 2015
@chelarua
Copy link
Contributor Author

Hi,
thanks for the swift response.

I must renounce my previous statement of this happening all the time, this morning it happened 5 times in a row, full setup creation and manual deletion after the destroy giving timeouts, but now it doesnt reproduce anymore. It might have been indeed caused by some slowness on the AWS part.

I'm using Terraform v0.6.1.

This is my code related to ecs:

resource "aws_ecs_service" "rq_web_api_elx_server" {
  name = "rq_web_api_elx_server"
  cluster = "${aws_ecs_cluster.rq_ecs_cluster.id}"
  task_definition = "${aws_ecs_task_definition.rq_web_api_elx_server.arn}"
  desired_count = 1
  iam_role = "${aws_iam_role.rq_ecs_role.id}"

  load_balancer {
    elb_name = "${aws_elb.rq_load_balancer.id}"
    container_name = "rq_web_api_elx_server"
    container_port = "${var.rq_web_api_elx_port}"
  }
  depends_on = ["aws_ecs_cluster.rq_ecs_cluster"]

}

resource "aws_elb" "rq_load_balancer" {
  name = "rqloadbalancer"

  security_groups = ["${aws_security_group.rq_elb_security_group.id}"]
  cross_zone_load_balancing = true
  subnets = ["${aws_subnet.rq_subnet.id}"]

  listener {
    instance_port = "${var.rq_web_api_elx_port}"
    instance_protocol = "http"
    ssl_certificate_id = "${var.api_ssl_certificate_id}"
    lb_port = 443
    lb_protocol = "https"
  }

  health_check {
    healthy_threshold = 2
    unhealthy_threshold = 5
    timeout = 10
    target = "TCP:${var.rq_web_api_elx_port}"
    interval = 30
  }
}

resource "aws_ecs_task_definition" "rq_web_api_elx_server" {
  family = "rq_web_api_elx_server"
  depends_on = ["aws_ecs_cluster.rq_ecs_cluster", "aws_elb.rq_load_balancer"]
}

resource "aws_ecs_cluster" "rq_ecs_cluster" {
  name = "rq_ecs_cluster"
}

resource "aws_instance" "rq_container_instance" {
    ami = "${lookup(var.ecs_amis, var.region)}"
    availability_zone = "${var.availability_zone}"
    instance_type = "t2.micro"
    key_name = "${var.ssh_key_name}"
    security_groups = ["${aws_security_group.rq_ecs_security_group.id}"]
    subnet_id = "${aws_subnet.rq_subnet.id}"
    associate_public_ip_address = true
    source_dest_check = false
    iam_instance_profile = "rq_ecs_profile"
    user_data = "${file("config-ecs")}"
    tags {
        Name = "rq_container_instance"
    }
    depends_on = ["aws_iam_instance_profile.rq_ecs_profile"]
}

I'll get back with the log if i see this happening again.

@chelarua
Copy link
Contributor Author

Managed to reproduce it again, this is the debug log
https://gist.github.com/chelarua/658ad6a3e1b9be871756

@radeksimko radeksimko removed the waiting-response An issue/pull request is waiting for a response from the community label Jul 31, 2015
@radeksimko
Copy link
Member

@chelarua When this happens again, can you try and check

aws ecs describe-services --cluster=<your-cluster-name> --services=<your-service-name> --region=<your-aws-region>

for me and see what's inside "events"?

I did manage to reproduce this when creating & destroying the whole stack very quickly (e.g. in acceptance tests). ECS service remains in DRAINING state, having this in events:

"events": [
                {
                    "message": "(service sampletest) failed to describe instance health on (elb foobar-terraform-test) with (error User: arn:aws:sts::714610209185:assumed-role/EcsService/ecs-service-scheduler is not authorized to perform: elasticloadbalancing:DescribeInstanceHealth)",
                    "id": "2c45138a-2512-457d-904d-a9f1c2c63169",
                    "createdAt": 1440348878.758
                }
]

It is not possible to remove that service (i.e. get it into INACTIVE state) until I add the IAM policy back.

The only simple solution I can think of is being added in #3061 , specifically in MeredithCorpOSS@9c2a3e7

@dmikalova
Copy link

I was able to solve the inactive task definition issue with the example in the ECS task definition data source. You set up the ECS service resource to use the the max revision of either what your Terraform resource has created, or what is in the AWS console which the data source retrieves.

The one downside to this is if someone changes the task definition, Terraform will not realign that to what's defined in code.

@ghost
Copy link

ghost commented Apr 6, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants