AWS SQS Queue Scaler fails to account for in-flight messages, premature scale down #1663

TyBrown · 2021-03-11T22:35:20Z

Report

KEDA AWS SQS Scaler fails to account for any in-flight (aka: NotVisible) messages when determining if the ScaledObject should scale-in, and prematurely scales the object down, even if the message is still processing/in-flight.

Expected Behavior

KEDA should wait to scale down until both ApproximateNumberOfMessagesNotVisible and ApproximateNumberOfMessagesNotVisible reach 0, allowing any in-flight messages to complete processing.

Actual Behavior

When there are N messages on the SQS Queue and Keda scales up accordingly, once the last message starts processing the ApproximateNumberOfMessages metric goes to 0, KEDA waits for cooldown period, then scales object back down to 0, interrupting the in-flight processing of the last message.

Note: We renew SQS message leases during long processing, so messages can stay in-flight (aka: NotVisible) for long periods of time, depending on how long it takes to process the message.

We were able to work around this problem temporarily by increasing the cooldown period to ensure that we have enough time to process our messages, however, this is not a good longterm solution as it means frequently leaving pods running much longer than necessary, and if the cooldown is not long enough, can cause us to fail to process the last message on the queue.

Steps to Reproduce the Problem

Use a ScaledObject such as:

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-queue-testing
spec:
  scaleTargetRef:
    name: testing
  pollingInterval: 30
  cooldownPeriod:  300
  minReplicaCount: 0
  maxReplicaCount: 1
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/1234567890/testing
      awsRegion: "us-east-1"
      queueLength: "1"
      identityOwner: operator

Send a message to the Queue which will be processed by the app, but requires more than 5mins to process. (You could just put a sleep in your app, as long as you continue to renew the SQS message lease, the message will remain in-flight). ApproximateNumberOfMessages == 1, ApproximateNumberOfMessagesNotVisible == 0
KEDA will scale up the pod, app will start processing the message, ApproximateNumberOfMessages == 0, ApproximateNumberOfMessagesNotVisible == 1
KEDA will poll next interval and see that ApproximateNumberOfMessages == 0 and begin cooldown period.
At end of cooldown period, KEDA will scale deploy down to minimum, which is 0 in this case, but app was not finished processing. SQS will timeout message (since app is not renewing lease anymore), and message will become visible again.
KEDA will scale up again. This will happen over and over until SQS Message reaches max attempts, and will go into dead letter queue.

Logs from KEDA operator

No response

KEDA Version

2.1.0

Kubernetes Version

1.18

Platform

Amazon Web Services

Scaler Details

AWS SQS Queue

The text was updated successfully, but these errors were encountered:

TyBrown added the bug Something isn't working label Mar 11, 2021

TyBrown mentioned this issue Mar 11, 2021

AWS SQS Scaler: Add Visible + NotVisible messages for scaling considerations #1664

Merged

3 tasks

zroubalik closed this as completed in #1664 Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS SQS Queue Scaler fails to account for in-flight messages, premature scale down #1663

AWS SQS Queue Scaler fails to account for in-flight messages, premature scale down #1663

TyBrown commented Mar 11, 2021

AWS SQS Queue Scaler fails to account for in-flight messages, premature scale down #1663

AWS SQS Queue Scaler fails to account for in-flight messages, premature scale down #1663

Comments

TyBrown commented Mar 11, 2021

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details