Applying ECS ServiceAutoScaling is failing sometimes #920

liorchen · 2017-06-20T13:12:43Z

Hi there,

I'm experiencing issue using aws_appautoscaling_policy integrated with cloud watch.
The error occurs at about 50% percent of the apply calls.

The error goes like this:

Error applying plan:

1 error(s) occurred:

* aws_appautoscaling_policy.down: Error retrieving scaling policies: FailedResourceAccessException: Unable to retrieve alarms for scaling policy arn:aws:autoscaling:us-east-1:XXXXXXXXXX:scalingPolicy:87ec26c2-09e1-4d98-9ac0-3d4dc0722ce5:resource/ecs/service/serverapidevcluster/serverapidev:policyName/serverapidev-scale-down due to reason: The security token included in the request is invalid. (Service: AmazonCloudWatch; Status Code: 403; Error Code: InvalidClientTokenId; Request ID: 952ec10b-55aa-11e7-a05e-213270703f2f)
	status code: 400, request id: 9525c12f-55aa-11e7-98c9-0934451f5c16

I cannot find the root cause to this problem, is it something in my configuration or something in AWS? The fact that it is not consistent might suggest that it's some sort of race condition (which can be resolved using depends_on) but I already checked in the relevant places and placed depends on where I thought it should be.

Thank you for opening an issue. Please note that we try to keep the Terraform issue tracker reserved for bug reports and feature requests. For general usage questions, please see: https://www.terraform.io/community.html.

Terraform Version

Terraform v0.8.8

Affected Resource(s)

Please list the resources as a list, for example:

aws_appautoscaling_policy
aws_appautoscaling_target
aws_cloudwatch_metric_alarm

Terraform Configuration Files

module "ecs_asg" {
  source = "../asg"
  availability_zones = "${var.availability_zones}"
  aws_region = "${var.aws_region}"
  docker_auth = "${var.docker_auth}"
  ecs_cluster_name = "${aws_ecs_cluster.main.name}"
  ecs_service_name = "${var.ecs_service_name}"
  instance_sg_id = "${aws_security_group.instance_sg.id}"
  key_name = "${var.key_name}"
  subnet_ids = ["${var.public_subnet_ids}"]
  asg_desired = "${var.asg_desired}"
  asg_min = "${var.asg_min}"
  asg_max = "${var.asg_max}"
  elb_name = "${aws_elb.main.name}"
  instance_type = "${var.instance_type}"
}

## ELB
resource "aws_elb" "main" {
  name     = "${format("%selb", var.ecs_service_name)}"
  subnets = ["${var.public_subnet_ids}"]
  security_groups = ["${aws_security_group.lb_sg.id}"]
  cross_zone_load_balancing = "${length(var.public_subnet_ids) > 1}"

  tags {
    Env = "${var.env}"
  }

  listener {
    instance_port = 8080
    instance_protocol = "HTTP"
    lb_port = 80
    lb_protocol = "HTTP"
  }
}

//### Security

resource "aws_security_group" "lb_sg" {
  description = "controls access to the application ELB"

  vpc_id = "${var.vpc_id}"
  name   = "${format("%s.ecs-lbsg", var.ecs_service_name)}"

  ingress {
    protocol    = "tcp"
    from_port   = 80
    to_port     = 80
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    protocol    = "tcp"
    from_port   = 443
    to_port     = 443
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"

    cidr_blocks = [
      "0.0.0.0/0",
    ]
  }
}

resource "aws_security_group" "instance_sg" {
  description = "controls direct access to application instances"
  vpc_id      = "${var.vpc_id}"
  name        = "tf-ecs-instsg-${var.ecs_service_name}"

  ingress {
    protocol  = "tcp"
    from_port = 22
    to_port   = 22

    cidr_blocks = [
      "${var.admin_cidr_ingress}",
    ]
  }

  ingress {
    protocol  = "tcp"
    from_port = 8080
    to_port   = 8080

    security_groups = [
      "${aws_security_group.lb_sg.id}",
    ]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

## ECS

resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 10
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.ecs_service.name}"
  role_arn           = "${aws_iam_role.app_scaling_ecs_service.arn}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  depends_on = [
    "aws_ecs_service.ecs_service",
    "aws_iam_role_policy.app_scaling_ecs_service"
  ]
}

resource "aws_appautoscaling_policy" "up" {
  name = "${var.ecs_service_name}-scale-up"
  service_namespace = "ecs"
  resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.ecs_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  adjustment_type = "ChangeInCapacity"
  cooldown = 60
  metric_aggregation_type = "Maximum"

  step_adjustment {
    metric_interval_lower_bound = 0
    scaling_adjustment = 1
  }

  depends_on = ["aws_appautoscaling_target.ecs_target"]
}

resource "aws_appautoscaling_policy" "down" {
  name = "${var.ecs_service_name}-scale-down"
  service_namespace = "ecs"
  resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.ecs_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  adjustment_type = "ChangeInCapacity"
  cooldown = 60
  metric_aggregation_type = "Maximum"

  step_adjustment {
    metric_interval_lower_bound = 0
    scaling_adjustment = -1
  }

  depends_on = ["aws_appautoscaling_target.ecs_target"]
}

resource "aws_cloudwatch_metric_alarm" "service_cpu_high" {
  alarm_name = "${var.ecs_service_name}-cpuutilization-high"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods = "2"
  metric_name = "CPUUtilization"
  namespace = "AWS/ECS"
  period = "60"
  statistic = "Maximum"
  threshold = "85"

  dimensions {
    ClusterName = "${aws_ecs_cluster.main.name}"
    ServiceName = "${aws_ecs_service.ecs_service.name}"
  }

  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${module.ecs_asg.scale_up_policy}"
  ]
  ok_actions = [
    "${aws_appautoscaling_policy.down.arn}",
    "${module.ecs_asg.scale_down_policy}"
  ]
}

resource "aws_cloudwatch_metric_alarm" "service_memory_high" {
  alarm_name = "${var.ecs_service_name}-memutilization-high"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods = "2"
  metric_name = "MemoryUtilization"
  namespace = "AWS/ECS"
  period = "60"
  statistic = "Maximum"
  threshold = "65"

  dimensions {
    ClusterName = "${aws_ecs_cluster.main.name}"
    ServiceName = "${aws_ecs_service.ecs_service.name}"
  }

  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${module.ecs_asg.scale_up_policy}"
  ]
  ok_actions = [
    "${aws_appautoscaling_policy.down.arn}",
    "${module.ecs_asg.scale_down_policy}"
  ]

}

resource "aws_ecs_cluster" "main" {
  name     = "${format("%scluster", var.ecs_service_name)}"
}


data "template_file" "task_definition" {
  template = "${file("${var.task_definition_file}")}"
  vars = "${merge(var.docker_replacements, map("log_group_region", var.aws_region, "log_group_name", module.ecs_asg.app_log_group_name, "lb_dns", aws_elb.main.dns_name, "lb_zone_id", aws_elb.main.zone_id))}"
}

resource "aws_ecs_task_definition" "task_definition" {
  family   =  "${format("%std", var.ecs_service_name)}"
  container_definitions = "${data.template_file.task_definition.rendered}"
}


resource "aws_ecs_service" "ecs_service" {
  name            = "${var.ecs_service_name}"
  cluster         = "${aws_ecs_cluster.main.id}"
  task_definition = "${aws_ecs_task_definition.task_definition.arn}"
  desired_count   = 1
  iam_role        = "${aws_iam_role.ecs_service.name}"

  load_balancer {
    elb_name = "${aws_elb.main.name}"
    container_name   = "${var.traffic_in_container}"
    container_port   = "${var.traffic_in_port}"
  }

  depends_on = [
    "aws_iam_role_policy.ecs_service",
    "aws_elb.main",
  ]
}



resource "aws_iam_role" "ecs_service" {
  name = "${format("%s-iam-role", var.ecs_service_name)}"

  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}


resource "aws_iam_role_policy" "ecs_service" {
  name = "${format("%s-ecs_policy", var.ecs_service_name)}"
  role = "${aws_iam_role.ecs_service.name}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
        "elasticloadbalancing:DeregisterTargets",
        "elasticloadbalancing:Describe*",
        "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
        "elasticloadbalancing:RegisterTargets"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}



resource "aws_iam_role" "app_scaling_ecs_service" {
  name = "${format("%s-app-autoscaling-iam-role", var.ecs_service_name)}"
  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "application-autoscaling.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "app_scaling_ecs_service" {
  name = "${format("%s-app-scaling-ecs_policy", var.ecs_service_name)}"
  role = "${aws_iam_role.app_scaling_ecs_service.name}"

  policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1456535218000",
            "Effect": "Allow",
            "Action": [
                "ecs:DescribeServices",
                "ecs:UpdateService"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "Stmt1456535243000",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:DescribeAlarms"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
EOF
}

There are dependency module here which is not provided, I wonder if by just looking at this configuration you would be able to help me find the problem.

Thanks
Lior

The text was updated successfully, but these errors were encountered:

liorchen · 2017-09-13T06:17:12Z

?

anosulchik · 2017-09-21T19:28:23Z

Guys, let me add that this problem is also can be reproduced in Terraform v0.10.6. At the same time we were able to work around it by adding depend_on to aws_cloudwatch_metric_alarm that triggers ecs scaling.

* aws_appautoscaling_policy.up: aws_appautoscaling_policy.up: Error retrieving scaling policies: FailedResourceAccessException: Unable to retrieve alarms for scaling policy arn:aws:autoscaling:us-east-1:00000000000:scalingPolicy:f5f99814-1902-4c68-8cc4-13070f2c75c6:resource/ecs/service/tf-mycluster-dev/tf-mywebservice-dev:policyName/mywebservice-dev-scale-up due to reason: The security token included in the request is invalid. (Service: AmazonCloudWatch; Status Code: 403; Error Code: InvalidClientTokenId; Request ID: c72fc5fa-9f01-11e7-bf89-e755133ddece)
	status code: 400, request id: c726ec5d-9f01-11e7-92e6-4536773d46e6

This one produced occasional errors described in this ticket:

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_load_high" {
  alarm_name = "tf-${var.service_name}-${var.environment_name}-cpu-high"
  ...
  alarm_description = "This alarm triggers when CPU load in ECS service ${aws_ecs_service.ecs_service.name} is high."
  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${data.terraform_remote_state.cluster.lowprio_notifs_sns_topic_arn}"
  ]
  ...

}

This one "fixed" the problem:

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_load_high" {
  alarm_name = "tf-${var.service_name}-${var.environment_name}-cpu-high"
  ...
  alarm_description = "This alarm triggers when CPU load in ECS service ${aws_ecs_service.ecs_service.name} is high."
  alarm_actions = [
    "${aws_appautoscaling_policy.up.arn}",
    "${data.terraform_remote_state.cluster.lowprio_notifs_sns_topic_arn}"
  ]
  ...

  depends_on = [ "aws_appautoscaling_policy.up" ]
}

anosulchik · 2017-09-21T20:10:40Z

I'd like to update my last post -- it seems to be that workaround that I reported doesn't help to get rid of error mentions by @liorchen completely. It still happens from time to time Terraform v0.10.6 (latest at this moment).

liorchen · 2017-09-27T07:49:10Z

thanks!!

ghost · 2020-04-11T17:15:35Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

stack72 added the bug Addresses a defect in current functionality. label Jul 7, 2017

radeksimko mentioned this issue Sep 13, 2017

r/appautoscaling_policy: Add support for DynamoDB #1650

Merged

radeksimko closed this as completed in #1650 Sep 26, 2017

ghost locked and limited conversation to collaborators Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Applying ECS ServiceAutoScaling is failing sometimes #920

Applying ECS ServiceAutoScaling is failing sometimes #920

liorchen commented Jun 20, 2017

liorchen commented Sep 13, 2017

anosulchik commented Sep 21, 2017

anosulchik commented Sep 21, 2017

liorchen commented Sep 27, 2017

ghost commented Apr 11, 2020

Applying ECS ServiceAutoScaling is failing sometimes #920

Applying ECS ServiceAutoScaling is failing sometimes #920

Comments

liorchen commented Jun 20, 2017

Terraform Version

Affected Resource(s)

Terraform Configuration Files

liorchen commented Sep 13, 2017

anosulchik commented Sep 21, 2017

anosulchik commented Sep 21, 2017

liorchen commented Sep 27, 2017

ghost commented Apr 11, 2020