Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdk deploy in endless loop cause of Fargate Service cant fire up task #7746

Closed
logemann opened this issue May 1, 2020 · 24 comments
Closed
Assignees
Labels
bug This issue is a bug. needs-triage This issue or PR still needs to be triaged.

Comments

@logemann
Copy link

logemann commented May 1, 2020

I am deploying a codepipeline stack with deployment to a fargate service. Problem is, when there is an issue starting the fargate task, the deployment never returns because fargate tries to start the task over and over again (like every minute or so).

Roughly my code is:

public createEcsDeployAction(vpc: Vpc, ecrRepo: ecr.Repository, buildOutput : Artifact): EcsDeployAction {
    return new EcsDeployAction({
      actionName: 'EcsDeployAction',
      service: this.createLoadBalancedFargateService(this, vpc, ecrRepo).service,
      input: buildOutput,
    })
  };


  createLoadBalancedFargateService(scope: Construct, vpc: Vpc, ecrRepository: ecr.Repository) {
    return new ecspatterns.ApplicationLoadBalancedFargateService(scope, 'myLbFargateService', {
      vpc: vpc,
      serviceName: "HelloWorldFargateService",
      memoryLimitMiB: 512,
      cpu: 256,
      taskImageOptions: {
        image: ecs.ContainerImage.fromEcrRepository(ecrRepository, "latest"),
      },
    });
  }

My problem could be that i define an image in the LoadBalancedFargateService which isnt available during deployment of the stack because codePipeline didnt run yet. Dont know for sure.

Question remains if its wise to just never terminate the "cdk deploy" cause of neverending tries to fire up a task in the backend.

Reproduction Steps

hard to reproduce out of context.

Error Log

no error in console on cdk deploy. Hard to find the real error. Tried it via AWS console without success.

Environment

  • CLI Version : aws-cli/2.0.10 Python/3.8.2 Darwin/19.4.0 botocore/2.0.0dev14
  • Framework Version: 1.36.1 (build 4df7dac)
  • OS : Mac OS X

This is 🐛 Bug Report

@logemann logemann added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 1, 2020
@jonny-rimek
Copy link

jonny-rimek commented May 1, 2020

I have the same issue with the ApplicationLoadbalancedFargate pattern. I use fromAsset(pathName) and deploy via cdk deploy locally. Looking at the task page there was an error message that it couldn't download the image from ECR. To me, that sounds like it is an error with the preconfigured IAM permissions. I launched the task via the Console and the Task succesfully reached the status RUNNING

	const vpc = new ec2.Vpc(this, 'Vpc', {
			subnetConfiguration: [{
				name: 'publicSubnet',
				subnetType: ec2.SubnetType.PUBLIC,
			}],
			natGateways: 0,
		})

		const postgres = new rds.DatabaseInstance(this, 'Postgres', {
			engine: rds.DatabaseInstanceEngine.POSTGRES,
			instanceClass: ec2.InstanceType.of(ec2.InstanceClass.BURSTABLE2, ec2.InstanceSize.MICRO),
			masterUsername: 'postgres',
			deletionProtection: false,
			vpc,
			vpcPlacement: { subnetType: ec2.SubnetType.PUBLIC }
		})

		postgres.connections.allowFromAnyIpv4(ec2.Port.tcp(5432))

		const loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'Service', {
			vpc,
			memoryLimitMiB: 512,
			cpu: 256,
			desiredCount: 1,
			taskImageOptions: {
				image: ecs.ContainerImage.fromAsset('services/api'),
			},
		});

CDK version 1.36.1

@logemann
Copy link
Author

logemann commented May 2, 2020

@jonny-rimek thats a different scenario than mine. I dont have IAM problems.

I digged in deeper and to me it looks like a chicken/egg problem. When i remove my EcsDeployment Stage from Codepipeline and deploy my stack from scratch everything works. Of course this gets me a docker image in my ECS repo (because codepipeline runs). Now when i re-add the ECS Deployment stage in my code and re-deploy the stack, everything works because now there is a docker image in the ECR repo. Subsequent codepipeline runs triggered via Github repo change work too and i get full auto-deployment and stuff.

So currently i must deploy my stack in two steps, first without the deployment stage and then with it included. Looks wrong to me.

IMO the problem is that ApplicationLoadBalancedFargateService directly wants to bootstrap an image via:

taskImageOptions: {
        image: ecs.ContainerImage.fromEcrRepository(ecrRepository, "latest"),
      },

it doesnt know that its embedded in EcsDeployAction where it should act only when there is an imagedefinitions.json on input attribute.

@jonny-rimek
Copy link

to you deploy via cdk deploy inside a Code Build Step or do you use Code Deploy?

@jonny-rimek
Copy link

jonny-rimek commented May 2, 2020

		const fargateTask = new ecs.FargateTaskDefinition(this, 'FargateTask', {
			cpu: 256,
			memoryLimitMiB: 512,
		})

		fargateTask.addContainer("GinContainer", {
			image: ecs.ContainerImage.fromAsset('services/api')
		})

		const cluster = new ecs.Cluster(this, 'Cluster', {
			containerInsights: true,
			vpc
		})

		const fargateService = new ecs.FargateService(this, 'FargateService', {
			cluster,
			taskDefinition: fargateTask,
			desiredCount: 1,
			assignPublicIp: true,
			platformVersion: ecs.FargatePlatformVersion.VERSION1_4
		})

I dropped the ecs pattern and did everything from scratch and the deployment works just fine(no ALB yet)

If I remove assignPublicIp I get the following error message Stopped reason ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr.... and the deployment is back to being stuck

@logemann
Copy link
Author

logemann commented May 2, 2020

Need to correct my previous statement. Its indeed a IAM issue but i dont understand why. Thats what i ve seen in the task logs:

Status reason CannotPullECRContainerError: AccessDeniedException: User: arn:aws:sts::xxxxxxxxxxx:assumed-role/ContDeployStack-myLbFargateServiceTaskDefExecution-10XL6RI11AJ8J/f81d47b2-fa8e-4e14-ab0c-56a172e7d825 is not authorized to perform: ecr:GetAuthorizationToken

Before i thought its a chicken egg problem and tried to circumvent it by using:

taskImageOptions: {
        containerName: repoName,
        //image: ecs.ContainerImage.fromEcrRepository(ecrRepository, "latest"),
        image: ecs.ContainerImage.fromRegistry("hello-world"),
        containerPort: 8080,
      },

Here "hello-world" ist the most tiny image on dockerhub i could find which acts as a placeholder as long as my codepipeline runs.

Now a clean "cdk deploy" finishes ok but the problem is that now when my codepipeline finishes the new image wont be pulled into ECS. The scary thing is that my pipeline worked yesterday witbout problems and i could easily trigger new builds and ECS was updated accordingly.

@logemann
Copy link
Author

logemann commented May 2, 2020

Ok, i am facing a two headed snake here. It looks like if i use the "placeholder" image approach with fromRegistry("hello-world"), then CDK cant know that i want to pull from ECR when my codepipeline finishes, thus not having the correct permissions in the taskExecutionRole. I can fix that with:

fargateService.taskDefinition.executionRole?.addManagedPolicy((ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryPowerUser')));

When i use the fromEcrRepository(ecrRepository, "latest") approach, then i still think i have a problem with not having an image available at first deploy time which leaves me with the endless deploy loop. Because here i think its not a permission problem because CDK does know that i want to interact with ECR and should create the default taskExecutionRole accordingly.

I will try both approaches from scratch now to see if my findings hold. Always takes ages of course to test because destroying and deploying complex stacks takes a while.

@logemann
Copy link
Author

logemann commented May 2, 2020

Ok. This is the detailed error when directly referencing a non-exising image with fromEcrRepository():

CannotPullContainerError: Error response from daemon: manifest for 985582282849.dkr.ecr.eu-central-1.amazonaws.com/hello-world-webapp:latest not found

So to me it looks like the placeholder-dummy image for 1st time deployment is the only way to go. If you do it this way, you need to add a policy like mentioned in my previous post, because otherwise the CDK created TaskExecutionRole has not enough permissions.

Hope i have not put too much infos in here, but this way other people can get an idea what to do. To the AWS-CDK dev team: Is there a way to solve this in an elegant way?

@skinny85
Copy link
Contributor

skinny85 commented May 4, 2020

Hey @logemann ,

yes, this is an issue. Basically, the problem is that we're missing a concept in the CDK currently, that represent "an image that doesn't exist yet, but will be created when the CodePipeline runs".

In a demo project we've done a long time ago, we have a class that represents exactly that. This is how it is used: [1], [2].

Would adding this class to the main CDK project solve your issue @logemann ? If so, I will convert this issue to a feature request.

Thanks,
Adam

@skinny85 skinny85 self-assigned this May 4, 2020
@logemann
Copy link
Author

logemann commented May 4, 2020

hey @skinny85 ,

thanks for commenting. Just checked the project and the class you mentioned. Quite some amount of code (not only the class but also the surroundings) to solve this particular issue. I would rather use my placeholder image instead of going the mentioned way.

You might check the Tutorial i ve just finished regarding this issue to see how i approached this.
https://medium.com/aws-factory/continuous-delivery-of-a-java-spring-boot-webapp-with-aws-cdk-codepipeline-and-docker-56e045812bd2
(any feedback would be awesome too quite frankly :-))

From a dev standpoint it would be super nice if ApplicationLoadBalancedFargateService would be smart enough to know it is wrapped in EcsDeployAction and somehow do a different initializing behavior. But i am way too bad in Cloudformation inner workings to know if this is even possible.

@skinny85
Copy link
Contributor

skinny85 commented May 4, 2020

Well, this would solve the following issue that you talk about in your tutorial:

This is a small node express webserver i provided on Dockerhub which only displays a “Waiting for Codepipeline Docker image” webpage. But why do we need this? On the very first deployment of our stack, we cant reference our “to be build” image from our Codepipeline because it wont be there on startup of the stack. So we can’t code something like fromEcrRepository(ecrRepository, “latest”). This would result in an endless deployment loop on the console because AWS tries to start this Service unlimited times and the deployment will just wait on the console for the successful startup which will never happen.

Instead, you would simply do this:

createLoadBalancedFargateService(scope: Construct, vpc: Vpc, ecrRepository: ecr.Repository, pipelineProject: PipelineProject) {
  var fargateService = new ecspatterns.ApplicationLoadBalancedFargateService(scope, 'myLbFargateService', {
     // ...
     taskImageOptions: {
        containerName: repoName,
        image: new PipelineContainerImage(ecrRepository),
        containerPort: 8080,
     },
  });
 
  fargateService.taskDefinition.executionRole?.addManagedPolicy(
    ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryPowerUser'));
  return fargateService;
}

And no weird workarounds are needed... isn't that strictly better?

@logemann
Copy link
Author

logemann commented May 5, 2020

indeed... i somehow didnt fully understand the project you mentioned because i though you need to use CloudFormationCreateUpdateStackAction and some other things to get PipelineContainerImage up and running.

Somehow couldnt see that PipelineContainerImage will suffice. To me thats definitely worth a FeatRequest then. Can you write in 2 short sentences what imageName in PipelineContainerImage gets resolved to and why it doesnt have the same problems (no finding an image) ?

Thanks.

Note: Will update the tutorial then, that there might be something coming along the way to make it even better. But for that i should understand it at least haha.

Update: From what i can gather from the class is that there is some lazy evaluation going on with regards to the imageName in the ECR repo. But i really dont get what PipelineParam is. And cant get an idea of the (in my case) unused methods like paramName()

@skinny85
Copy link
Contributor

skinny85 commented May 6, 2020

The trick with imageName that it's turned into a CloudFormation parameter - that's what PipelineParam is.

If you see, then the parameter is filled in the CloudFormationCreateUpdateStackAction to be the URL of the image that was pushed to the ECR repo inside of the CodeBuild job.

If you don't want to use CloudFormationCreateUpdateStackAction, but EcsDeployAction, I guess the situation is even simpler: you don't need `PipelineContainerImage, your workaround of the dummy web server image (probably any other dummy image containing a server would do, you don't need a dedicated one I think) works fine.

However, there is a problem: the action will update the image "out-of-band", causing an intentional drift in the CloudFormation state (the actual image will be something different than the image parameter in CloudFormation). This might prove problematic (for example, and update to your service's properties might make the image be reverted to its original), and in general is a bad practice.

For those reasons, I would advise against using EcsDeployAction, and would instead use CloudFormationCreateUpdateStackAction with PipelineContainerImage.

Does this make sense?

@logemann
Copy link
Author

logemann commented May 7, 2020

Yeah makes sense and then i got it right that you cant use PipelineContainerImage isolated. But still i dont think its developer friendly. Using CloudFormationCreateUpdateStackAction feels like quite a big workaround too if at the end you just want to use EcsDeployAction. I think we can close this one, adding PipelineContainerImage to the distro would only make sense if there is a ton of documentation how to use it in conjunction with CloudFormationCreateUpdateStackAction as kind of a replacement to EcsDeployAction for this specific use case. A use case which is IMO quite mainstream.

@logemann logemann closed this as completed May 7, 2020
@ChrisLahaye
Copy link

@logemann were you able to solve this issue?

skinny85 added a commit to skinny85/aws-cdk that referenced this issue Nov 30, 2020
…ant to be used in CodePipeline

While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK,
it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code.
This adds a new class to the ECS module,
`TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline.

Related to aws#1237
Related to aws#7746
skinny85 added a commit to skinny85/aws-cdk that referenced this issue Dec 1, 2020
…ant to be used in CodePipeline

While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK,
it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code.
This adds a new class to the ECS module,
`TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline.

Related to aws#1237
Related to aws#7746
skinny85 added a commit to skinny85/aws-cdk that referenced this issue Dec 7, 2020
…ant to be used in CodePipeline

While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK,
it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code.
This adds a new class to the ECS module,
`TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline.

Related to aws#1237
Related to aws#7746
mergify bot pushed a commit that referenced this issue Dec 7, 2020
… be used in CodePipeline (#11795)

While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK,
it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code.
This adds a new class to the ECS module,
`TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline.

Related to #1237
Related to #7746

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
flochaz pushed a commit to flochaz/aws-cdk that referenced this issue Jan 5, 2021
… be used in CodePipeline (aws#11795)

While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK,
it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code.
This adds a new class to the ECS module,
`TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline.

Related to aws#1237
Related to aws#7746

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@ianpogi5
Copy link

ianpogi5 commented Jul 9, 2021

		const fargateTask = new ecs.FargateTaskDefinition(this, 'FargateTask', {
			cpu: 256,
			memoryLimitMiB: 512,
		})

		fargateTask.addContainer("GinContainer", {
			image: ecs.ContainerImage.fromAsset('services/api')
		})

		const cluster = new ecs.Cluster(this, 'Cluster', {
			containerInsights: true,
			vpc
		})

		const fargateService = new ecs.FargateService(this, 'FargateService', {
			cluster,
			taskDefinition: fargateTask,
			desiredCount: 1,
			assignPublicIp: true,
			platformVersion: ecs.FargatePlatformVersion.VERSION1_4
		})

I dropped the ecs pattern and did everything from scratch and the deployment works just fine(no ALB yet)

If I remove assignPublicIp I get the following error message Stopped reason ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr.... and the deployment is back to being stuck

@jonny-rimek did you find the solution to this? I also dropped ecs pattern and now I am facing the same issue.

@hansfpc
Copy link

hansfpc commented Sep 19, 2021

same problem here

@dineshtrivedi
Copy link

I am having a similar problem with ecs_pattern and QueueProcessingFargateService:

ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed

@yoshinorihisakawa
Copy link

I cant solve this problem....

@univerze
Copy link

Hi, I had the same problem, I managed to get a workaround with the pattern. It's more code though

    const vpc = new ec2.Vpc(this, "myvpc", {
      maxAzs: 3
    });

    const cluster = new ecs.Cluster(this, "mycluster", {
      vpc: vpc
    });

    const repo = new ecr.Repository(this, 'apprepo', {
      repositoryName: 'app-repo',
    });

    const execRole = new iam.Role(this, 'taskexecutionrole', {
      assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
    });
execRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryPowerUser'));

    const taskDef = new ecs.FargateTaskDefinition(this, 'mytaskdef', {
      family: 'app-taskdef',
      executionRole: execRole,
    });
    taskDef.addContainer('myappimage', {
      image: ecs.ContainerImage.fromRegistry("amazon/amazon-ecs-sample"),
      containerName: 'nodejs',
      portMappings: [
        { containerPort: 80 }
      ]
    });

    new ecs_patterns.ApplicationLoadBalancedFargateService(this, "myecsservice", {
      cluster: cluster,
      publicLoadBalancer: true,
      desiredCount: 0,
      taskDefinition: taskDef,
    });

At this point it may be better to drop the pattern and create everything manually. With this tho I can use GitHub actions to push to the ecr repo with no issues, and the nodejs app starts.

@shresthapradip
Copy link

The trick with imageName that it's turned into a CloudFormation parameter - that's what PipelineParam is.

If you see, then the parameter is filled in the CloudFormationCreateUpdateStackAction to be the URL of the image that was pushed to the ECR repo inside of the CodeBuild job.

If you don't want to use CloudFormationCreateUpdateStackAction, but EcsDeployAction, I guess the situation is even simpler: you don't need `PipelineContainerImage, your workaround of the dummy web server image (probably any other dummy image containing a server would do, you don't need a dedicated one I think) works fine.

However, there is a problem: the action will update the image "out-of-band", causing an intentional drift in the CloudFormation state (the actual image will be something different than the image parameter in CloudFormation). This might prove problematic (for example, and update to your service's properties might make the image be reverted to its original), and in general is a bad practice.

For those reasons, I would advise against using EcsDeployAction, and would instead use CloudFormationCreateUpdateStackAction with PipelineContainerImage.

Does this make sense?

Still happening in newer cdk version: 2.41
And the class PipelineContainerImage is no more compatible.

@rafaelmarques7
Copy link

this is happening to me as well

@0xBradock
Copy link

0xBradock commented Jul 3, 2024

Hello,

I am still trying to deploy the most basic stack just to get ApplicationLoadBalancedFargateService working.
I took the example from the documentation and tried to deploy cdk deploy stack-app.

It hangs.
I didn't manage to get any logs.

import { Construct } from 'constructs';
import { App, Stack, StackProps } from 'aws-cdk-lib';
import { Vpc } from 'aws-cdk-lib/aws-ec2';
import { Cluster, ContainerImage } from 'aws-cdk-lib/aws-ecs';
import { ApplicationLoadBalancedFargateService } from 'aws-cdk-lib/aws-ecs-patterns';

const app = new App();

export class ApiSixStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const vpc = Vpc.fromLookup(this, 'VPC', { isDefault: true });

    const cluster = new Cluster(this, 'Cluster', { vpc });

    new ApplicationLoadBalancedFargateService(this, 'Service', {
      cluster,
      memoryLimitMiB: 1024,
      desiredCount: 1,
      cpu: 512,
      taskImageOptions: { image: ContainerImage.fromRegistry('amazon/amazon-ecs-sample') },
      loadBalancerName: 'application-lb-name',
    });
  }
}

const env = {
  account: process.env.CDK_DEFAULT_ACCOUNT,
  region: process.env.CDK_DEFAULT_REGION,
};

new ApiSixStack(app, 'stack-app', { env });

app.synth();

Any help is appreciated,

@skinny85
Copy link
Contributor

skinny85 commented Jul 3, 2024

@0xBradock check out the container status in ECS. I would bet the port amazon/amazon-ecs-sample listens on is different (the default is 80, and it probably listens on 8080, or something).

@wjes
Copy link

wjes commented Sep 2, 2024

@0xBradock I think it's because the load balancer cannot reach Fargate services' health checks. The default VPC you're using doesn't have private subnets and since assignPublicIp is false by default (a.k.a "use private subnets") your Fargate services end up isolated in a limbo.

You can solve it in two ways:

  1. Set the assignPublicIp to true in ApplicationLoadBalancedFargateService. This way Fargate will live in the public subnet and the load balancer should be able to reach it. The cons is that besides the load balancer, Fargate services will be accessed directly from that public IP. Also you'll have to pay for that reserved IP.

  2. Create a private subnet in your default VPC. Even better, create a new VPC with public and private subnets. The load balancer should live in the public subnet and Fargate in the private subnet. The cons is that Fargate will be completely isolated from the internet, only accessible through the load balancer. If the apps running in Fargate need to fetch an external API you'll have to include a NAT Gateway into the mix (which may be expensive, so make sure to set natGateways to 1 in your VPC constructor, otherwise one NAT Gateway will be created in each public subnet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. needs-triage This issue or PR still needs to be triaged.
Projects
None yet
Development

No branches or pull requests