Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying ECS ServiceAutoScaling is failing sometimes #920

Closed
liorchen opened this issue Jun 20, 2017 · 5 comments · Fixed by #1650
Closed

Applying ECS ServiceAutoScaling is failing sometimes #920

liorchen opened this issue Jun 20, 2017 · 5 comments · Fixed by #1650
Labels
bug Addresses a defect in current functionality.

Comments

@liorchen
Copy link

Hi there,

I'm experiencing issue using aws_appautoscaling_policy integrated with cloud watch.
The error occurs at about 50% percent of the apply calls.

The error goes like this:

Error applying plan:

1 error(s) occurred:

* aws_appautoscaling_policy.down: Error retrieving scaling policies: FailedResourceAccessException: Unable to retrieve alarms for scaling policy arn:aws:autoscaling:us-east-1:XXXXXXXXXX:scalingPolicy:87ec26c2-09e1-4d98-9ac0-3d4dc0722ce5:resource/ecs/service/serverapidevcluster/serverapidev:policyName/serverapidev-scale-down due to reason: The security token included in the request is invalid. (Service: AmazonCloudWatch; Status Code: 403; Error Code: InvalidClientTokenId; Request ID: 952ec10b-55aa-11e7-a05e-213270703f2f)
	status code: 400, request id: 9525c12f-55aa-11e7-98c9-0934451f5c16

I cannot find the root cause to this problem, is it something in my configuration or something in AWS? The fact that it is not consistent might suggest that it's some sort of race condition (which can be resolved using depends_on) but I already checked in the relevant places and placed depends on where I thought it should be.

Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

Terraform Version

Terraform v0.8.8

Affected Resource(s)

Please list the resources as a list, for example:

  • aws_appautoscaling_policy
  • aws_appautoscaling_target
  • aws_cloudwatch_metric_alarm

Terraform Configuration Files

module "ecs_asg" {
  source = "../asg"
  availability_zones = "${var.availability_zones}"
  aws_region = "${var.aws_region}"
  docker_auth = "${var.docker_auth}"
  ecs_cluster_name = "${aws_ecs_cluster.main.name}"
  ecs_service_name = "${var.ecs_service_name}"
  instance_sg_id = "${aws_security_group.instance_sg.id}"
  key_name = "${var.key_name}"
  subnet_ids = ["${var.public_subnet_ids}"]
  asg_desired = "${var.asg_desired}"
  asg_min = "${var.asg_min}"
  asg_max = "${var.asg_max}"
  elb_name = "${aws_elb.main.name}"
  instance_type = "${var.instance_type}"
}

## ELB
resource "aws_elb" "main" {
  name     = "${format("%selb", var.ecs_service_name)}"
  subnets = ["${var.public_subnet_ids}"]
  security_groups = ["${aws_security_group.lb_sg.id}"]
  cross_zone_load_balancing = "${length(var.public_subnet_ids) > 1}"

  tags {
    Env = "${var.env}"
  }

  listener {
    instance_port = 8080
    instance_protocol = "HTTP"
    lb_port = 80
    lb_protocol = "HTTP"
  }
}

//### Security

resource "aws_security_group" "lb_sg" {
  description = "controls access to the application ELB"

  vpc_id = "${var.vpc_id}"
  name   = "${format("%s.ecs-lbsg", var.ecs_service_name)}"

  ingress {
    protocol    = "tcp"
    from_port   = 80
    to_port     = 80
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    protocol    = "tcp"
    from_port   = 443
    to_port     = 443
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"

    cidr_blocks = [
      "0.0.0.0/0",
    ]
  }
}

resource "aws_security_group" "instance_sg" {
  description = "controls direct access to application instances"
  vpc_id      = "${var.vpc_id}"
  name        = "tf-ecs-instsg-${var.ecs_service_name}"

  ingress {
    protocol  = "tcp"
    from_port = 22
    to_port   = 22

    cidr_blocks = [
      "${var.admin_cidr_ingress}",
    ]
  }

  ingress {
    protocol  = "tcp"
    from_port = 8080
    to_port   = 8080

    security_groups = [
      "${aws_security_group.lb_sg.id}",
    ]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

## ECS

resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 10
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.ecs_service.name}"
  role_arn           = "${aws_iam_role.app_scaling_ecs_service.arn}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  depends_on = [
    "aws_ecs_service.ecs_service",
    "aws_iam_role_policy.app_scaling_ecs_service"
  ]
}

resource "aws_appautoscaling_policy" "up" {
  name = "${var.ecs_service_name}-scale-up"
  service_namespace = "ecs"
  resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.ecs_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  adjustment_type = "ChangeInCapacity"
  cooldown = 60
  metric_aggregation_type = "Maximum"

  step_adjustment {
    metric_interval_lower_bound = 0
    scaling_adjustment = 1
  }

  depends_on = ["aws_appautoscaling_target.ecs_target"]
}

resource "aws_appautoscaling_policy" "down" {
  name = "${var.ecs_service_name}-scale-down"
  service_namespace = "ecs"
  resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.ecs_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  adjustment_type = "ChangeInCapacity"
  cooldown = 60
  metric_aggregation_type = "Maximum"

  step_adjustment {
    metric_interval_lower_bound = 0
    scaling_adjustment = -1
  }

  depends_on = ["aws_appautoscaling_target.ecs_target"]
}

resource "aws_cloudwatch_metric_alarm" "service_cpu_high" {
  alarm_name = "${var.ecs_service_name}-cpuutilization-high"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods = "2"
  metric_name = "CPUUtilization"
  namespace = "AWS/ECS"
  period = "60"
  statistic = "Maximum"
  threshold = "85"

  dimensions {
    ClusterName = "${aws_ecs_cluster.main.name}"
    ServiceName = "${aws_ecs_service.ecs_service.name}"
  }

  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${module.ecs_asg.scale_up_policy}"
  ]
  ok_actions = [
    "${aws_appautoscaling_policy.down.arn}",
    "${module.ecs_asg.scale_down_policy}"
  ]
}

resource "aws_cloudwatch_metric_alarm" "service_memory_high" {
  alarm_name = "${var.ecs_service_name}-memutilization-high"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods = "2"
  metric_name = "MemoryUtilization"
  namespace = "AWS/ECS"
  period = "60"
  statistic = "Maximum"
  threshold = "65"

  dimensions {
    ClusterName = "${aws_ecs_cluster.main.name}"
    ServiceName = "${aws_ecs_service.ecs_service.name}"
  }

  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${module.ecs_asg.scale_up_policy}"
  ]
  ok_actions = [
    "${aws_appautoscaling_policy.down.arn}",
    "${module.ecs_asg.scale_down_policy}"
  ]

}

resource "aws_ecs_cluster" "main" {
  name     = "${format("%scluster", var.ecs_service_name)}"
}


data "template_file" "task_definition" {
  template = "${file("${var.task_definition_file}")}"
  vars = "${merge(var.docker_replacements, map("log_group_region", var.aws_region, "log_group_name", module.ecs_asg.app_log_group_name, "lb_dns", aws_elb.main.dns_name, "lb_zone_id", aws_elb.main.zone_id))}"
}

resource "aws_ecs_task_definition" "task_definition" {
  family   =  "${format("%std", var.ecs_service_name)}"
  container_definitions = "${data.template_file.task_definition.rendered}"
}


resource "aws_ecs_service" "ecs_service" {
  name            = "${var.ecs_service_name}"
  cluster         = "${aws_ecs_cluster.main.id}"
  task_definition = "${aws_ecs_task_definition.task_definition.arn}"
  desired_count   = 1
  iam_role        = "${aws_iam_role.ecs_service.name}"

  load_balancer {
    elb_name = "${aws_elb.main.name}"
    container_name   = "${var.traffic_in_container}"
    container_port   = "${var.traffic_in_port}"
  }

  depends_on = [
    "aws_iam_role_policy.ecs_service",
    "aws_elb.main",
  ]
}



resource "aws_iam_role" "ecs_service" {
  name = "${format("%s-iam-role", var.ecs_service_name)}"

  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}


resource "aws_iam_role_policy" "ecs_service" {
  name = "${format("%s-ecs_policy", var.ecs_service_name)}"
  role = "${aws_iam_role.ecs_service.name}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:Describe*",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:RegisterTargets"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}



resource "aws_iam_role" "app_scaling_ecs_service" {
  name = "${format("%s-app-autoscaling-iam-role", var.ecs_service_name)}"
  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "application-autoscaling.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "app_scaling_ecs_service" {
  name = "${format("%s-app-scaling-ecs_policy", var.ecs_service_name)}"
  role = "${aws_iam_role.app_scaling_ecs_service.name}"

  policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1456535218000",
            "Effect": "Allow",
            "Action": [
                "ecs:DescribeServices",
                "ecs:UpdateService"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "Stmt1456535243000",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:DescribeAlarms"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
EOF
}

There are dependency module here which is not provided, I wonder if by just looking at this configuration you would be able to help me find the problem.

Thanks
Lior

@stack72 stack72 added the bug Addresses a defect in current functionality. label Jul 7, 2017
@liorchen
Copy link
Author

?

@anosulchik
Copy link

Guys, let me add that this problem is also can be reproduced in Terraform v0.10.6. At the same time we were able to work around it by adding depend_on to aws_cloudwatch_metric_alarm that triggers ecs scaling.

* aws_appautoscaling_policy.up: aws_appautoscaling_policy.up: Error retrieving scaling policies: FailedResourceAccessException: Unable to retrieve alarms for scaling policy arn:aws:autoscaling:us-east-1:00000000000:scalingPolicy:f5f99814-1902-4c68-8cc4-13070f2c75c6:resource/ecs/service/tf-mycluster-dev/tf-mywebservice-dev:policyName/mywebservice-dev-scale-up due to reason: The security token included in the request is invalid. (Service: AmazonCloudWatch; Status Code: 403; Error Code: InvalidClientTokenId; Request ID: c72fc5fa-9f01-11e7-bf89-e755133ddece)
	status code: 400, request id: c726ec5d-9f01-11e7-92e6-4536773d46e6

This one produced occasional errors described in this ticket:

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_load_high" {
  alarm_name = "tf-${var.service_name}-${var.environment_name}-cpu-high"
  ...
  alarm_description = "This alarm triggers when CPU load in ECS service ${aws_ecs_service.ecs_service.name} is high."
  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${data.terraform_remote_state.cluster.lowprio_notifs_sns_topic_arn}"
  ]
  ...

}

This one "fixed" the problem:

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_load_high" {
  alarm_name = "tf-${var.service_name}-${var.environment_name}-cpu-high"
  ...
  alarm_description = "This alarm triggers when CPU load in ECS service ${aws_ecs_service.ecs_service.name} is high."
  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${data.terraform_remote_state.cluster.lowprio_notifs_sns_topic_arn}"
  ]
  ...

  depends_on = [ "aws_appautoscaling_policy.up" ]
}

@anosulchik
Copy link

I'd like to update my last post -- it seems to be that workaround that I reported doesn't help to get rid of error mentions by @liorchen completely. It still happens from time to time Terraform v0.10.6 (latest at this moment).

@liorchen
Copy link
Author

thanks!!

@ghost
Copy link

ghost commented Apr 11, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants