Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The recommended workaround for ignoring task definition changes causes the service's container_definitions to be overwritten on every Terraform apply, even ones that don't touch the service. #165

Open
1 task done
andrewedstrom opened this issue Feb 12, 2024 · 14 comments
Labels

Comments

@andrewedstrom
Copy link

Description

My team triggers deploys to ECS via Github actions, and we did not want to have to update Terraform whenever we do a
deploy.

We could not use the ignore_task_definition_changes, because that also (surprisingly) causes changes to the load balancer to be ignored (see #154 and #152).

According to the design doc, there is an officially supported workaround:

As an alternative, this module does provide a workaround that would support an external party making changes to
something like the image of the container definition. In a scenario where the service, task definition, and container
definition are all managed by Terraform, the following configuration could be used to allow an external party to
change > the image of the container definition without conflicting with Terraform, provided that the external party is
also > updating the image tag in a shared location that can be retrieved by Terraform (here, we are using SSM
Parameter > Store): [example follows]

We ended up going with this workaround, however it has an undesirable bug. Now, after a deploy via GitHub actions, whenever we make any change to our Terraform (even if it doesn't touch the ECS service in any way), the container_definitions of the service are recreated.

This does not cause downtime, but it does cause extra noise whenever we make a change to our terraform, and makes it hard to know during a terraform plan if the container definition has actually changed or if it's just being recreated without anything changing.

  • ✋ I have searched the open/closed issues and my issue is not listed.

Versions

  • Module version [Required]: 5.8.0

  • Terraform version: v1.5.7

  • Provider version(s):

    • provider registry.terraform.io/hashicorp/archive v2.4.2
    • provider registry.terraform.io/hashicorp/aws v5.34.0
    • provider registry.terraform.io/hashicorp/random v3.6.0
    • provider registry.terraform.io/hashicorp/tls v4.0.5

Reproduction Code [Required]

Unfortunately, I do not have the time to make a fully reproducible example. However, the official docs don't even provide a full example of how to implement the workaround, so I think it's fair for me to provide an example that's much more full than the official docs and still shows the general shape of how this works.

Based on the comments in #152, I get the impression that many teams want to use a similar setup, deploying with GitHub Actions, so it may be helpful to actually add something like this to the readme or examples/ directory. Feel free to borrow from the code I've shared below as needed.

ECS terraform:

# Cluster
module "ecs_cluster" {
  source       = "terraform-aws-modules/ecs/aws//modules/cluster"
  version      = "5.8.0"
  cluster_name = local.name

  default_capacity_provider_use_fargate = true
  fargate_capacity_providers            = {
    FARGATE = {
      default_capacity_provider_strategy = {
        weight = 100
        base   = 1
      }
    }
  }

  tags = local.tags
}

# Service
data "aws_ssm_parameter" "api_image_tag" {
  name = local.api_image_tag_parameter_name
}

module "api_ecs_service" {
  source      = "terraform-aws-modules/ecs/aws//modules/service"
  version     = "5.8.0"
  name        = "${local.name}-api-service"
  cluster_arn = module.ecs_cluster.arn

  cpu    = 512
  memory = 1024

  subnet_ids = local.private_subnets

  enable_execute_command = false

  # At least 2 tasks should be running at all times
  # autoscaling_min_capacity is actually the thing that ensures that at least 2 tasks are running at all times
  # desired_count is actually ignored and setting it is a no-op (https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/docs/README.md#service-1), but we set it anyway out of superstition
  desired_count            = 2
  autoscaling_min_capacity = 2

  load_balancer = {
    service = {
      target_group_arn = module.alb.target_groups["ecs_api_server"].arn
      container_name   = local.container_name
      container_port   = local.container_port
    }
  }
  security_group_rules = {
    alb_ingress_3000 = {
      type                     = "ingress"
      from_port                = local.container_port
      to_port                  = local.container_port
      protocol                 = "tcp"
      description              = "Service port"
      source_security_group_id = module.alb.security_group_id
    }
    egress_all = {
      type        = "egress"
      from_port   = 0
      to_port     = 0
      protocol    = "-1"
      cidr_blocks = ["0.0.0.0/0"]
    }
  }
  health_check_grace_period_seconds = 30

  ignore_task_definition_changes = false

  container_definitions = {
    (local.container_name) = {
      enable_cloudwatch_logging = true
      name                      = local.container_name
      image                     = "${local.ecr_repository}:${data.aws_ssm_parameter.api_image_tag.value}"
      cpu                       = 512
      memory                    = 1024
      privileged                = false
      essential                 = true

      port_mappings = [
        {
          name          = local.container_name
          containerPort = local.container_port
          hostPort      = local.container_port
          protocol      = "tcp"
        }
      ]
      environment = []
      secrets     = []
    }
  }
}

# Load Balancer
module "alb" {
  source  = "terraform-aws-modules/alb/aws"
  version = "~> 9.0"

  name = "api-load-balancer"

  load_balancer_type = "application"

  vpc_id  = local.vpc_id
  subnets = local.public_subnets

  enable_deletion_protection   = false
  security_group_ingress_rules = {
    all_http = {
      from_port   = 80
      to_port     = 80
      ip_protocol = "tcp"
      cidr_ipv4   = "0.0.0.0/0"
    }
    all_https = {
      from_port   = 443
      to_port     = 443
      ip_protocol = "tcp"
      description = "HTTPS web traffic"
      cidr_ipv4   = "0.0.0.0/0"
    }
  }
  security_group_egress_rules = {
    all = {
      ip_protocol = "-1"
      cidr_ipv4   = local.vpc_cidr_block
    }
  }

  listeners = {
    api_server_http_https_redirect = {
      port     = 80
      protocol = "HTTP"
      redirect = {
        port        = "443"
        protocol    = "HTTPS"
        status_code = "HTTP_301"
      }
    }
    api_server_https = {
      port            = 443
      protocol        = "HTTPS"
      certificate_arn = aws_acm_certificate.ecs_api.arn

      forward = {
        target_group_key = "ecs_api_server"
      }
    }
  }

  web_acl_arn = local.api_acl_arn

  target_groups = {
    ecs_api_server = {
      backend_protocol                  = "HTTP"
      backend_port                      = local.container_port
      target_type                       = "ip"
      deregistration_delay              = 5
      load_balancing_cross_zone_enabled = true

      health_check = {
        enabled             = true
        healthy_threshold   = 5
        interval            = 30
        matcher             = "200"
        path                = "/health"
        port                = "traffic-port"
        protocol            = "HTTP"
        timeout             = 5
        unhealthy_threshold = 4
      }

      # There's nothing to attach here in this definition. Instead,
      # ECS will attach the IPs of the tasks to this target group
      create_attachment = false
    }
  }
}

GitHub action to deploy:

name: Staging Deploy
on:
  push:
    branches:
      - main

concurrency:
  group: staging-deploy
  cancel-in-progress: false

permissions:
  id-token: write
  contents: read

env:
  ENV_NAME: ...
  ECS_CLUSTER: ...
  ECS_SERVICE: ...
  AWS_REGION: ...
  ROLE_TO_ASSUME: ...
  ECR_IMAGE: ...
  IMAGE_TAG_PARAMETER_NAME: ...
  CONTAINER_NAME: ... # from the container_definitions section of the task definition
  TASK_DEFINITION_FILE: task-definition.json # Not committed to source. Actually downloaded from ECS during the job

jobs:
  build:
    name: Build Docker Image
    runs-on: ubuntu-latest
    outputs:
      image: ${{ steps.build-image.outputs.image }}
      tag: ${{ steps.build-image.outputs.tag }}

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Build image
        id: build-image
        run: |
          TAG="staging-$(git rev-parse --short HEAD)"
          FULL_IMAGE_NAME="${ECR_IMAGE}:${TAG}"
          docker build -t "${FULL_IMAGE_NAME}" .
          echo "image=${FULL_IMAGE_NAME}" >>  "${GITHUB_OUTPUT}"
          echo "tag=${TAG}" >>  "${GITHUB_OUTPUT}"

      - name: Configure AWS credentials for image push
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ env.ROLE_TO_ASSUME }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Push image to Amazon ECR
        run: |
          docker push "${ECR_IMAGE}" --all-tags

  deploy-backend:
    name: Deploy API Server to ECS
    runs-on: ubuntu-latest
    needs: build
    env:
      AWS_REGION: us-east-2

    steps:
      - name: Configure AWS credentials for deployment
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ env.ROLE_TO_ASSUME }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Download existing task definition
        run: |
          aws ecs describe-task-definition --task-definition "${ECS_SERVICE}" --query taskDefinition > "${TASK_DEFINITION_FILE}"

      - name: Fill in the new image ID in the Amazon ECS task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ${{ env.TASK_DEFINITION_FILE }}
          container-name: ${{ env.CONTAINER_NAME }}
          image: ${{ needs.build.outputs.image }}

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true

      - name: Update SSM parameter of image
        run: |
          aws ssm put-parameter --name "${{ env.IMAGE_TAG_PARAMETER_NAME }}" --value "${{ needs.build.outputs.tag }}" --type String --overwrite

Expected behavior

After applying this terraform and then deploying from GitHub, if I change an unrelated .tf file in the same terraform project, then when I run terraform plan, no changes should be shown in my container definition.

Actual behavior

The container definition gets updated and the service is recreated.

I am 100% sure this is not due to task definition drift outside of GitHub Actions and Terraform, as the task definition has never been updated except by Github Actions and Terraform. We deploy to multiple environments using the same scripts and this bug appears in all of those environments.

Terminal Output Screenshot(s)

I cannot share a screenshot, but when I run terraform plan, it shows these two resources getting changed

  • module.api_ecs_service.aws_ecs_service.this[0]
    • the task_definition is changed
  • module.api_ecs_service.aws_ecs_task_definition.this[0]
    • The container_definitions are updated, forcing replacement. I cannot see what part of the definition has been changed since the entire thing is marked as a sensitive value. That said, upon an investigation of what's in AWS, I think the update to the container definitions happens because the registeredAt, registeredBy, and revision properties of the task definition always change after a deployment by GitHub Actions.

Additional context

  • As an alternative to fixing this bug, it would be incredible to just update ignore_task_definition_changes so that it doesn't also cause load balancer configuration to be ignored, per Ignoring load_balancer is not always what one want for CD #152. We'd love to completely manage the Task Definition via GitHub actions, but that is unworkable for us because ignore_task_definition_changes also causes the load balancer config to be ignored.
  • Another option would be to improve the documentation for the workaround, actually showing an example of how to connect the load balancer to the service when ignore_task_definition_changes = true. I've filed a separate ticket for that suggestion: Expand README section on ignore_task_definition_changes and Blue/Green deployments #154
@andrewedstrom andrewedstrom changed the title The recommended workaround for ignoring task definition changes causes the service's container_definitions to be overwritten on every Terraform apply, even when the container definitions have not changed.. The recommended workaround for ignoring task definition changes causes the service's container_definitions to be overwritten on every Terraform apply, even ones that don't touch the service. Feb 12, 2024
@bolshakoff
Copy link

I can relate to you. 🙂
I ended up simply running this before working with terraform:

terraform apply -refresh-only

This reconciles the state with reality a bit, and so the subsequent terraform apply does not cause re-deploy.

@bolshakoff
Copy link

Another option is to keep an eye on #171.

Copy link

This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days

@github-actions github-actions bot added the stale label Mar 26, 2024
@bryantbiggs bryantbiggs added wip and removed stale labels Mar 26, 2024
@jakub-pietrzak
Copy link

I can relate to you. 🙂 I ended up simply running this before working with terraform:

terraform apply -refresh-only

This reconciles the state with reality a bit, and so the subsequent terraform apply does not cause re-deploy.

This workaround does not seem to work for me.

Here is the output of refresh-only command:

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:

  # module.app_env.module.ecs.module.service_images["xxx"].aws_ssm_parameter.ignore_value[0] has changed
  ~ resource "aws_ssm_parameter" "ignore_value" {
        id             = "xxx"
      ~ insecure_value = "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx:version1" -> "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx:version2"
        name           = "xxx"
        tags           = {}
      ~ version        = 4 -> 5
        # (5 unchanged attributes hidden)
    }

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_service.this[0] has changed
  ~ resource "aws_ecs_service" "this" {
        id                                 = "arn:aws:ecs:eu-central-1:xxx:service/xxx/xxx"
        name                               = "xxx"
        tags                               = {}
      ~ task_definition                    = "xxx:18" -> "xxx:19"
        # (14 unchanged attributes hidden)

        # (6 unchanged blocks hidden)
    }

When I run it again, terraform detects no changes, however a regular plan/apply tries to update service & task def

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].data.aws_ecs_task_definition.this[0] will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_ecs_task_definition" "this" {
      + arn                  = (known after apply)
      + arn_without_revision = (known after apply)
      + execution_role_arn   = (known after apply)
      + family               = (known after apply)
      + id                   = (known after apply)
      + network_mode         = (known after apply)
      + revision             = (known after apply)
      + status               = (known after apply)
      + task_definition      = "xxx"
      + task_role_arn        = (known after apply)
    }

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_service.this[0] will be updated in-place
  ~ resource "aws_ecs_service" "this" {
        id                                 = "arn:aws:ecs:eu-central-1:xxx:service/xxx/xxx"
        name                               = "xxx"
        tags                               = {}
      ~ task_definition                    = "xxx:19" -> (known after apply)
        # (14 unchanged attributes hidden)

        # (6 unchanged blocks hidden)
    }

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_task_definition.this[0] must be replaced
+/- resource "aws_ecs_task_definition" "this" {
      ~ arn                      = "arn:aws:ecs:eu-central-1:xxx:task-definition/xxx:18" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:eu-central-1:xxx:task-definition/xxx" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ image                  = "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx-xxx:version1" -> "xxx.dkr.ecr.eu-central-1.amazonaws.com/xxx-xxx:version2"
                    name                   = "xxx"
                    # (17 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ id                       = "xxx" -> (known after apply)
      ~ revision                 = 18 -> (known after apply)
      - tags                     = {} -> null
        # (10 unchanged attributes hidden)

        # (2 unchanged blocks hidden)
    }

Even if I manually register new task definition in the aws console, the following change would still cause the service to be redeployed

  # module.app_env.module.ecs.module.ecs.module.service["xxx"].aws_ecs_service.this[0] will be updated in-place
  ~ resource "aws_ecs_service" "this" {
        id                                 = "arn:aws:ecs:eu-central-1:xxx:service/xxx/xxx"
        name                               = "xxx"
        tags                               = {}
      ~ task_definition                    = "xxx:20" -> "xxx:21"
        # (14 unchanged attributes hidden)

        # (6 unchanged blocks hidden)
    }

fyi I also apply the example with the latest image version stored in SSM.

@pgusa
Copy link

pgusa commented Aug 13, 2024

I am working on a similiar task where our CI pipeline updates the image and creates a new revision in the console. SInce terraform is not aware of this change and still have old state file with some old revision, it tries to force replace the task definition.

I am currently trying to implement the logic where I can ignore these changes in my task definition module, but I cannot implement it smartly as I want user to choose if these changes can be ignored. Lifecycle block is useless here and I am looking for suggestions if anyone knows about the use case where solution like this has been implemented.

@tetienne
Copy link

@andrewedstrom Did you find a workaround since February? I faced exactly the situation.

@guidodobboletta
Copy link

I'm also facing the same issue as OP. I have a centralized place where terraform and my CI/CD grab the same image value yet this reports that there is changes all the time.

I checked the task definition generated by CI/CD and the one generated by terraform and they are the exact same the only difference is the registeredAt and registeredBy and revision. I tried adding tofu refresh before the actual terraform plan and we still see the same problem.

@tetienne
Copy link

tetienne commented Aug 30, 2024

@bryantbiggs Sorry to ping you, but have you any idea how we can deal with this issue? Is there any workaround to this workaround? Allowing to deploy through the CI and managing the task through Terraform sounds really perfect.

@bryantbiggs
Copy link
Member

unfortunately, I think we have exhausted the various hacks and avenues for trying to deal with this at a module/resource level. the solution needs to ultimately come from the ECS API - I don't know if there is anything that could be done at the Terraform AWS provider level

@tetienne
Copy link

tetienne commented Sep 2, 2024

Thx @bryantbiggs for your feedback. Is there anyplace where we can ask for such improvement?
Out of curiosity, this workaround was working at somepoint?

@bryantbiggs
Copy link
Member

I would highly encourage you to reach out to your AWS account team and provide them feedback on what you are trying to do with ECS and the challenges you are facing. This is how Amazon works (working backwards from the customer) and its how we will drive actual change from the service side, instead of pursuing hacks through the limited set of functionality we have at our disposal at the resource/module level

@tetienne
Copy link

tetienne commented Sep 4, 2024

Thank you for the recommendation, I’ve sent a message to our AWS support team.
Regarding this issue, here’s another discussion thread where it's being extensively talked about: hashicorp/terraform-provider-aws#632

@tetienne
Copy link

tetienne commented Sep 9, 2024

@bryantbiggs So the AWS support ask me to create an issue within the ECS roadmap repository.
The point is I don’t know at all what to ask. I understood there is a lack within the ECS API, but which one exactly? May I ask you to create it? Perhaps the fact you work at AWS can help increasing the priority or help up to find a workaround.

@bryantbiggs
Copy link
Member

There are already various existing requests on the containers roadmap, I've listed a few below that I believe are related/relevant:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants