-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks #20706
Comments
It is the responsibility of the queue worker when it pulls the message from the queue to hide it from the rest of the workers (with the maximum being 12 hours) upon pulling it from the message queue. However, what I'm hearing from this ticket is that a sufficiently complex task (e.g. video de/encode) which pegs the CPU usage or memory usage for a prolonged time will be culled early because of the scaling configuration, is that correct? |
Yes, kind of like -
An analogy will be SQS - lambda pattern you do not see any memory or CPU based scaling in that cause it have the same effect as above, scaling should be flexible and have a priority. Lambda just times out on high memory and cpu issues. QueueProcessingFargateService is a solution for SQS-lambda type pattern where we know the task will take more than the 15min timeout of lambda to process and we do not want to maintain our own EC2 instances. ASGs/ECS also have the same issue but EC2 instances have a feature called Termination Protection where you can disable the scale in of the EC2 instance until the process has completed and no more activity is required. |
I have the same problem. In my case I have a service using QueueProcessingFargateService for text parsing. Sometimes there are more than 1000 tasks in the AWS SQS. Each task has low CPU load. And when my SQS count scalling rule adds a new instance, CPU scalling rule stops it in 1 minute. After my SQS count scalling rule adds a new instance again, and CPU scalling rule stops it too. To solve it I manually deleted the CPU scalling rule in the AWS console, but I think this is not a good solution to do it manually in the console |
Sharing how I am working around this issue at the moment. I created a class that derives from // see: https://github.com/aws/aws-cdk/issues/20706
export class QueueAndMessageBasedScalingOnlyFargateService extends QueueProcessingFargateService {
/**
* Configure autoscaling based off of number of messages visible in the SQS queue only.
*
* @param service the ECS/Fargate service for which to apply the autoscaling rules to
*/
protected configureAutoscalingForService(service: BaseService) {
const scalingTarget = service.autoScaleTaskCount({
maxCapacity: this.maxCapacity,
minCapacity: this.minCapacity,
});
scalingTarget.scaleOnMetric("QueueMessagesVisibleScaling", {
metric: this.sqsQueue.metricApproximateNumberOfMessagesVisible(),
scalingSteps: this.scalingSteps,
});
}
} Then I can simply use the derived class ( |
For others who are running into this problem while we work on the linked PR, there's an alternative. ECS now offers task-level scale in protection via the newly released task scale-in protection endpoint. Individual tasks are now able to protect themselves during long-running compute work. If your worker detects that it's going to be processing a long-running job, it can call the ECS Agent URI (automatically injected into all containers as an environment variable) like so:
to enable 60 minutes of scale in protection for itself. The ExpiresInMinutes param is optional, and defaults to 2 hours of protection. The downside is that this requires updates to your service code but does offer configuration of what tasks the scheduler will allow to be killed. |
Running into this scaler thrashing in production. If you look at ECS events one alarm is increasing the desired count because there are a lot of objects on the queue, and one is decreasing the count because the task is using a low amount of CPU. This repeats every couple minutes and no work gets done. Our task makes sequential HTTP requests and is mainly IO- and memory-bound, so hard to saturate the CPU. I'm not sure what the original intention of sneaking the CPU scaler into this construct was but it seems like a bug if the two scalers conflict like this. Sad that the fix in #23310 died a slow death (it seems like there was some uncertainty whether this was legitimate behavior or not). |
@keenangraham I was just looking at this yesterday thinking no one has resolved this yet, added a new PR hopefully this will get reviewed and merged. |
…rget utilization (#28315) Added an optional parameter that defaults to false over the CPU-based scaling policy that conflicts with the queue visible message-based policy. When disabled this will stop the race condition issue mentioned in #20706 by only allowing the scaling of the number of messages on the queue similar to the SQS-Lambda pattern. Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually. Updated integration tests and unit tests are working. Closes #20706 . ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
|
…rget utilization (aws#28315) Added an optional parameter that defaults to false over the CPU-based scaling policy that conflicts with the queue visible message-based policy. When disabled this will stop the race condition issue mentioned in aws#20706 by only allowing the scaling of the number of messages on the queue similar to the SQS-Lambda pattern. Note: If this parameter is enabled then this bug will crop up again and the user has to handle the container termination manually. Updated integration tests and unit tests are working. Closes aws#20706 . ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Nice. Thanks @AnuragMohapatra! |
Describe the bug
Current Scenario
For the scaling policy of Queue processing fargate service, 2 parts are added -
This can be found here -
aws-cdk/packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts
Line 344 in fd5808f
Issue
The CPU base scaling does not seem appropriate in a Queue processing fargate service, the fargate service should only scale out or in depending on the number of messages are there in the queue, not the CPU utilization of the system.
Because of the CPU-based scaling, the auto-scaling group may start a new instance that will process the same message again if there is a CPU-intensive process triggered by the message and is not completed within the scaling alarm trigger.
Also, if the process is memory intensive then the CPU-based scaling will always be in alarm causing the auto-scaling group to remove a task till it reaches the desired capacity.
These scenarios are also relevant for the memory utilization metric but the running task is actually CPU intensive.
Since there is no task-level termination protection, and disable scale-in feature is missing from the patterns this can cause the ASG to terminate a task that is mid-execution.
Expected Behavior
When a Queue processing fargate service has been set up to only scale-out on an approximate number of messages in the queue and the scale-in has been disabled it should not terminate the tasks.
Current Behavior
The ASG on Queue Processing fargate service starts terminating the task if the task is memory intensive and has a long processing time, because of a CW scale in alarm triggered from the CPUUtilizationMetric Scaling policy, thus terminating a random task mid-execution.
Reproduction Steps
Following CDK -
will create a new QueueProcessingFargateService with following type of scaling policy -
which causes conflicting alarms to be always trigger -
Possible Solution
The issue is with this method in the Queue processing fargate service pattern base -
aws-cdk/packages/@aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts
Line 344 in fd5808f
It is adding a default CPUUtilizationScalingPolicy that cannot be removed, edited nor disabled.
Solution 1
Remove the CPU Utilization scaling factor if not necessarily required.
Solution 2
Add optional properties and let the user modify the value to disable scale-in on CPU utilization metric or let the user modify the values as per the user's will.
Additional Information/Context
No response
CDK CLI Version
2.27
Framework Version
No response
Node.js Version
16.14.2
OS
Linux
Language
Typescript
Language Version
3.9.7
Other information
No response
The text was updated successfully, but these errors were encountered: