Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS SQS Queue Scaler fails to account for in-flight messages, premature scale down #1663

Closed
TyBrown opened this issue Mar 11, 2021 · 0 comments · Fixed by #1664
Closed

AWS SQS Queue Scaler fails to account for in-flight messages, premature scale down #1663

TyBrown opened this issue Mar 11, 2021 · 0 comments · Fixed by #1664
Labels
bug Something isn't working

Comments

@TyBrown
Copy link
Contributor

TyBrown commented Mar 11, 2021

Report

KEDA AWS SQS Scaler fails to account for any in-flight (aka: NotVisible) messages when determining if the ScaledObject should scale-in, and prematurely scales the object down, even if the message is still processing/in-flight.

Expected Behavior

KEDA should wait to scale down until both ApproximateNumberOfMessagesNotVisible and ApproximateNumberOfMessagesNotVisible reach 0, allowing any in-flight messages to complete processing.

Actual Behavior

When there are N messages on the SQS Queue and Keda scales up accordingly, once the last message starts processing the ApproximateNumberOfMessages metric goes to 0, KEDA waits for cooldown period, then scales object back down to 0, interrupting the in-flight processing of the last message.

Note: We renew SQS message leases during long processing, so messages can stay in-flight (aka: NotVisible) for long periods of time, depending on how long it takes to process the message.

We were able to work around this problem temporarily by increasing the cooldown period to ensure that we have enough time to process our messages, however, this is not a good longterm solution as it means frequently leaving pods running much longer than necessary, and if the cooldown is not long enough, can cause us to fail to process the last message on the queue.

Steps to Reproduce the Problem

  1. Use a ScaledObject such as:
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-queue-testing
spec:
  scaleTargetRef:
    name: testing
  pollingInterval: 30
  cooldownPeriod:  300
  minReplicaCount: 0
  maxReplicaCount: 1
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/1234567890/testing
      awsRegion: "us-east-1"
      queueLength: "1"
      identityOwner: operator
  1. Send a message to the Queue which will be processed by the app, but requires more than 5mins to process. (You could just put a sleep in your app, as long as you continue to renew the SQS message lease, the message will remain in-flight). ApproximateNumberOfMessages == 1, ApproximateNumberOfMessagesNotVisible == 0
  2. KEDA will scale up the pod, app will start processing the message, ApproximateNumberOfMessages == 0, ApproximateNumberOfMessagesNotVisible == 1
  3. KEDA will poll next interval and see that ApproximateNumberOfMessages == 0 and begin cooldown period.
  4. At end of cooldown period, KEDA will scale deploy down to minimum, which is 0 in this case, but app was not finished processing. SQS will timeout message (since app is not renewing lease anymore), and message will become visible again.
  5. KEDA will scale up again. This will happen over and over until SQS Message reaches max attempts, and will go into dead letter queue.

Logs from KEDA operator

No response

KEDA Version

2.1.0

Kubernetes Version

1.18

Platform

Amazon Web Services

Scaler Details

AWS SQS Queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant