aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706

AnuragMohapatra · 2022-06-11T06:10:36Z

Describe the bug

Current Scenario

For the scaling policy of Queue processing fargate service, 2 parts are added -

Queue length-based scaling - In this scenario, if the user has not provided a step-down counter the system auto-calculates that it does not need a scale-in and does not create a scale-in alarm.
Cpu base scaling - In this scenario the scale-in and scale-out are now dependent on the avg CPU utilisation of the system.

This can be found here -

aws-cdk/packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

Line 344 in fd5808f

protected configureAutoscalingForService(service: BaseService) {

Issue

The CPU base scaling does not seem appropriate in a Queue processing fargate service, the fargate service should only scale out or in depending on the number of messages are there in the queue, not the CPU utilization of the system.

Because of the CPU-based scaling, the auto-scaling group may start a new instance that will process the same message again if there is a CPU-intensive process triggered by the message and is not completed within the scaling alarm trigger.

Also, if the process is memory intensive then the CPU-based scaling will always be in alarm causing the auto-scaling group to remove a task till it reaches the desired capacity.

These scenarios are also relevant for the memory utilization metric but the running task is actually CPU intensive.

Since there is no task-level termination protection, and disable scale-in feature is missing from the patterns this can cause the ASG to terminate a task that is mid-execution.

Expected Behavior

When a Queue processing fargate service has been set up to only scale-out on an approximate number of messages in the queue and the scale-in has been disabled it should not terminate the tasks.

Current Behavior

The ASG on Queue Processing fargate service starts terminating the task if the task is memory intensive and has a long processing time, because of a CW scale in alarm triggered from the CPUUtilizationMetric Scaling policy, thus terminating a random task mid-execution.

Reproduction Steps

Following CDK -

  import { Stack, StackProps } from 'aws-cdk-lib';
  import { Construct } from 'constructs';
  import { QueueProcessingFargateService }  from 'aws-cdk-lib/aws-ecs-patterns'
  import { ContainerImage } from 'aws-cdk-lib/aws-ecs';
  
  export class QueueProcessingFargateServiceAutoscaleTestStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
      super(scope, id, props);  
      var containerImage = ContainerImage.fromAsset('test')  
      var service = new QueueProcessingFargateService(this, 'test', {
        image: containerImage,
        scalingSteps : [
          { upper: 0, change: 0 },{ lower: 1, change: +1 }
        ]
     })
    }
  }

will create a new QueueProcessingFargateService with following type of scaling policy -

which causes conflicting alarms to be always trigger -

Possible Solution

The issue is with this method in the Queue processing fargate service pattern base -

aws-cdk/packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts

Line 344 in fd5808f

protected configureAutoscalingForService(service: BaseService) {

It is adding a default CPUUtilizationScalingPolicy that cannot be removed, edited nor disabled.

Solution 1

Remove the CPU Utilization scaling factor if not necessarily required.

Solution 2

Add optional properties and let the user modify the value to disable scale-in on CPU utilization metric or let the user modify the values as per the user's will.

Additional Information/Context

No response

CDK CLI Version

2.27

Framework Version

No response

Node.js Version

16.14.2

OS

Linux

Language

Typescript

Language Version

3.9.7

Other information

No response

The text was updated successfully, but these errors were encountered:

indrora · 2022-08-05T21:50:01Z

It is the responsibility of the queue worker when it pulls the message from the queue to hide it from the rest of the workers (with the maximum being 12 hours) upon pulling it from the message queue.

However, what I'm hearing from this ticket is that a sufficiently complex task (e.g. video de/encode) which pegs the CPU usage or memory usage for a prolonged time will be culled early because of the scaling configuration, is that correct?

AnuragMohapatra · 2022-08-06T04:09:57Z

Yes, kind of like -

a. There are 5 messages in queue and Queue based scaling says to increase if there are more than 2 messages.
b. Fargate service scales out, and creates a 2 new container one after another.
c. But none of the container are performing a CPU intensive activity just IO based activity (may be they need to wait for a
third party doc generate service to complete before they can move ahead with it), so the avg CPU utilisation is low.
d. So now the CPU base scale in alarm kicks in and removes a container which was processing a message mid way, you have
lost valuable time and potentially some information that might have been time bound.
e. You can only hide the messages from consumer not remove the message else queue message scale in will trigger and
remove the container.
a. There is only 3 message in queue and Queue based scaling says to increase if there are more than 4 messages.
b. Fargate service does nothing since there is no alarm yet.
c. But the container is performing a CPU intensive activity so the avg CPU utilisation becomes high.
d. So now the CPU base scale out alarm kicks in and adds a container which starts processing a message.
e. Then the Queue based scale in kicks in because the messages are below the threshold thus removes a container that is
processing a message mid way.
e. Again, now you have lost the processing that was done by this container which may cause issues in downstream
applications
f. you can only hide the messages from consumer not remove the message else queue message scale in will trigger
and remove the container.

An analogy will be SQS - lambda pattern you do not see any memory or CPU based scaling in that cause it have the same effect as above, scaling should be flexible and have a priority. Lambda just times out on high memory and cpu issues.

QueueProcessingFargateService is a solution for SQS-lambda type pattern where we know the task will take more than the 15min timeout of lambda to process and we do not want to maintain our own EC2 instances.

ASGs/ECS also have the same issue but EC2 instances have a feature called Termination Protection where you can disable the scale in of the EC2 instance until the process has completed and no more activity is required.

danilvalov · 2022-08-25T09:39:56Z

I have the same problem.

In my case I have a service using QueueProcessingFargateService for text parsing. Sometimes there are more than 1000 tasks in the AWS SQS. Each task has low CPU load. And when my SQS count scalling rule adds a new instance, CPU scalling rule stops it in 1 minute. After my SQS count scalling rule adds a new instance again, and CPU scalling rule stops it too.

To solve it I manually deleted the CPU scalling rule in the AWS console, but I think this is not a good solution to do it manually in the console

inh3 · 2022-12-16T03:30:16Z

Sharing how I am working around this issue at the moment.

I created a class that derives from QueueProcessingFargateService and overrides the protected configureAutoscalingForService(service: BaseService) method:

// see: https://github.com/aws/aws-cdk/issues/20706
export class QueueAndMessageBasedScalingOnlyFargateService extends QueueProcessingFargateService {
  /**
   * Configure autoscaling based off of number of messages visible in the SQS queue only.
   *
   * @param service the ECS/Fargate service for which to apply the autoscaling rules to
   */
  protected configureAutoscalingForService(service: BaseService) {
    const scalingTarget = service.autoScaleTaskCount({
      maxCapacity: this.maxCapacity,
      minCapacity: this.minCapacity,
    });
    scalingTarget.scaleOnMetric("QueueMessagesVisibleScaling", {
      metric: this.sqsQueue.metricApproximateNumberOfMessagesVisible(),
      scalingSteps: this.scalingSteps,
    });
  }
}

Then I can simply use the derived class (QueueAndMessageBasedScalingOnlyFargateService) in place of QueueProcessingFargateService within my software.

bvtujo · 2022-12-16T16:14:25Z

For others who are running into this problem while we work on the linked PR, there's an alternative. ECS now offers task-level scale in protection via the newly released task scale-in protection endpoint. Individual tasks are now able to protect themselves during long-running compute work.

If your worker detects that it's going to be processing a long-running job, it can call the ECS Agent URI (automatically injected into all containers as an environment variable) like so:

PUT $ECS_AGENT_URI/task-protection/v1/state -d 
'{"ProtectionEnabled":true,"ExpiresInMinutes":60}'

to enable 60 minutes of scale in protection for itself. The ExpiresInMinutes param is optional, and defaults to 2 hours of protection.

The downside is that this requires updates to your service code but does offer configuration of what tasks the scheduler will allow to be killed.

keenangraham · 2023-12-08T20:50:39Z

Running into this scaler thrashing in production.

If you look at ECS events one alarm is increasing the desired count because there are a lot of objects on the queue, and one is decreasing the count because the task is using a low amount of CPU. This repeats every couple minutes and no work gets done.

Our task makes sequential HTTP requests and is mainly IO- and memory-bound, so hard to saturate the CPU.

I'm not sure what the original intention of sneaking the CPU scaler into this construct was but it seems like a bug if the two scalers conflict like this.

Sad that the fix in #23310 died a slow death (it seems like there was some uncertainty whether this was legitimate behavior or not).

AnuragMohapatra · 2023-12-09T08:50:34Z

@keenangraham I was just looking at this yesterday thinking no one has resolved this yet, added a new PR hopefully this will get reviewed and merged.

…rget utilization (#28315) Added an optional parameter that defaults to false over the CPU-based scaling policy that conflicts with the queue visible message-based policy. When disabled this will stop the race condition issue mentioned in #20706 by only allowing the scaling of the number of messages on the queue similar to the SQS-Lambda pattern. Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually. Updated integration tests and unit tests are working. Closes #20706 . ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2023-12-22T23:29:55Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

…rget utilization (aws#28315) Added an optional parameter that defaults to false over the CPU-based scaling policy that conflicts with the queue visible message-based policy. When disabled this will stop the race condition issue mentioned in aws#20706 by only allowing the scaling of the number of messages on the queue similar to the SQS-Lambda pattern. Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually. Updated integration tests and unit tests are working. Closes aws#20706 . ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

keenangraham · 2024-01-08T20:08:15Z

@keenangraham I was just looking at this yesterday thinking no one has resolved this yet, added a new PR hopefully this will get reviewed and merged.

Nice. Thanks @AnuragMohapatra!

AnuragMohapatra added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 11, 2022

github-actions bot added the @aws-cdk/aws-ecs-patterns Related to ecs-patterns library label Jun 11, 2022

github-actions bot assigned madeline-k Jun 11, 2022

indrora added p1 @aws-cdk/aws-ecs Related to Amazon Elastic Container @aws-cdk/aws-sqs Related to Amazon Simple Queue Service and removed needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2022

AnuragMohapatra mentioned this issue Nov 12, 2022

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #22897

Closed

4 tasks

AnuragMohapatra mentioned this issue Dec 11, 2022

fix(aws-ecs-patterns): fix non-editable Scaling Policy causes race conditions & dropped tasks #23310

Closed

4 tasks

pahud added the p1.5 label May 19, 2023

otaviomacedo added p2 and removed p1 p1.5 labels May 22, 2023

madeline-k removed their assignment Oct 30, 2023

AnuragMohapatra mentioned this issue Dec 9, 2023

feat(ecs-patterns): support disabling CPU-based scaling and custom target utilization #28315

Merged

AnuragMohapatra mentioned this issue Dec 15, 2023

(aws-ecs-patterns): Integration testing of aws-ecs-patterns contains an invalid integration test #28383

Closed

mergify bot closed this as completed in #28315 Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706

aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706

AnuragMohapatra commented Jun 11, 2022 •

edited

Loading

indrora commented Aug 5, 2022

AnuragMohapatra commented Aug 6, 2022 •

edited

Loading

danilvalov commented Aug 25, 2022

inh3 commented Dec 16, 2022 •

edited

Loading

bvtujo commented Dec 16, 2022 •

edited

Loading

keenangraham commented Dec 8, 2023

AnuragMohapatra commented Dec 9, 2023

github-actions bot commented Dec 22, 2023

keenangraham commented Jan 8, 2024

aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706

aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706

Comments

AnuragMohapatra commented Jun 11, 2022 • edited Loading

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Solution 1

Solution 2

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

indrora commented Aug 5, 2022

AnuragMohapatra commented Aug 6, 2022 • edited Loading

danilvalov commented Aug 25, 2022

inh3 commented Dec 16, 2022 • edited Loading

bvtujo commented Dec 16, 2022 • edited Loading

keenangraham commented Dec 8, 2023

AnuragMohapatra commented Dec 9, 2023

github-actions bot commented Dec 22, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

keenangraham commented Jan 8, 2024

AnuragMohapatra commented Jun 11, 2022 •

edited

Loading

AnuragMohapatra commented Aug 6, 2022 •

edited

Loading

inh3 commented Dec 16, 2022 •

edited

Loading

bvtujo commented Dec 16, 2022 •

edited

Loading