Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Storage Queue scaler is reporting wrong number of messages in queue #2385

Closed
jungopro opened this issue Dec 5, 2021 · 15 comments
Closed
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@jungopro
Copy link

jungopro commented Dec 5, 2021

Report

Hello
I think the azure_queue monitor is not reading the number of messages in the queue as it should
Looking at the code here I think that when Keda calculates the "visible messages" as 0 it returns 0, even though the actual number of messages in the queue is much higher

For example:
We see "showing 0 of 128503 messages in queue" in the portal, and Keda is also reporting 0 for the metric (even though the actual count is 128k) - see the right side in the screenshot
When we work with the c# SDK for azure storage we're getting the right numbers - see left side in the screen shot (csharp code below):

image

public async Task Run()
        {
            m_tracer.TraceInformation($"{nameof(StorageAccountMetricsLogger)} - Starting work.");
            CloudStorageAccount storageAccount = m_aadStorageProvider.GetStorageAccount(m_queueStorageName);
            var queueClient = storageAccount.CreateCloudQueueClient();
            var queueNames = queueClient.ListQueues();
            foreach (var cloudQueue in queueNames)
            {
                var dimension = new Dimensions();
                dimension.SetDimension("Name", cloudQueue.Name);
                var instrumentor = new Instrumentor(dimension);
                await cloudQueue.FetchAttributesAsync();
                if (!cloudQueue.ApproximateMessageCount.HasValue)
                {
                    m_tracer.TraceWarning(0, $"{nameof(StorageAccountMetricsLogger)} - Failed to retrieve message count for storage queue {cloudQueue.Name}.");
                    continue;
                }
                instrumentor.LogMetric(s_monitoringStorageQueueActiveMessageCountMetricName, cloudQueue.ApproximateMessageCount.Value);
                m_tracer.TraceInformation($"{nameof(StorageAccountMetricsLogger)} - Metrics for storage queue name \"{cloudQueue.Name}\" messages = {cloudQueue.ApproximateMessageCount}");
            }
            m_tracer.TraceInformation($"{nameof(StorageAccountMetricsLogger)} - Done.");
        }

Expected Behavior

keda should report the actual # of messages in queue

Actual Behavior

keda is reporting 0 messages in queue

Steps to Reproduce the Problem

  1. add messages to azure queue
  2. query the queue with keda

Logs from KEDA operator

No response

KEDA Version

2.4.0

Kubernetes Version

1.19

Platform

Other

Scaler Details

Azure storage queue

Anything else?

No response

@jungopro jungopro added the bug Something isn't working label Dec 5, 2021
@JorTurFer
Copy link
Member

Hi @jungopro
Could you share KEDA's logs? (from operator and metric server)

@RamCohen
Copy link
Contributor

RamCohen commented Feb 9, 2022

Hi @jungopro

I was unable to reproduce the issue. Note that the scaler code also peeks at the queue to try to get an accurate number in order not to use the approximation if a definite number can be obtained.
What is the size of the messages in the queue ? If they are large it may be a result of a timeout during the peek phase.

@tomkerkhove tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022
@tomkerkhove tomkerkhove changed the title Keda Azure Storage Queue is reporting wrong number of messages in queue Azure Storage Queue scaler is reporting wrong number of messages in queue Feb 14, 2022
@antonio-policastro
Copy link

antonio-policastro commented Mar 1, 2022

I've just run into the same problem, and it's a big one for my use case.
It happens when the queue messages are not deleted within seconds after being popped from the queue.

The scaler peeks messages, and peek returns the visibile messages in the queue.
When a client Gets a message from the queue, it doesn't delete it immediately: it only hides it for a configurable period of time (see: visibility timeout) which defaults to 30 seconds.
The message must be deleted with a separate Delete operation.

Let's say I enqueue 10 messages.
Keda peeks 10 messages and creates 10 pods.
Each pod gets a message from the queue, processes it (e.g: calls a webservice, sends an email, etc) and then deletes the message.

If the time needed to process the message is high (e.g: a few hours), Keda will peek 0 messages and will try to scale in to 0 pods while the pods are still working.

Proposed solution:
The scaler should Get Queue Metadata insteak of Peek, and then use the x-ms-approximate-messages-count response header to scale in/out.
x-ms-approximate-messages-count includes hidden messages too.

@zroubalik
Copy link
Member

I am total Azure (Storage) noob, but in general, if it takes a few hours to process a message, then ScaledObject might not be the right option, you should try to use ScaleJob.

@antonio-policastro
Copy link

Hi @zroubalik, thanks for your answer.
I assume you are referring to the documentation.

One important consideration to make is how this pattern can work with long running executions. Imagine a deployment triggers on a RabbitMQ queue message. Each message takes 3 hours to process. It’s possible that if many queue messages arrive, KEDA will help drive scaling out to many replicas - let’s say 4. Now the HPA makes a decision to scale down from 4 replicas to 2. There is no way to control which of the 2 replicas get terminated to scale down. That means the HPA may attempt to terminate a replica that is 2.9 hours into processing a 3 hour queue message.

There are two main ways to handle this scenario.

Those two ways are suggested to handle a scenario in which the queue service has no clue over "when the message processing has been completed". Azure Storage Queue does have that information, as I explained earlier.

I am reluctant to port my code from Deployments to Jobs, because

  1. the porting has many drawbacks for my use case (which is very hard to explain in detail here).
  2. getting the x-ms-approximate-messages-count is faster thank peeking 32 messages from the queue, especially when the messages are large
  3. Since Azure Storage Queue can't peek more than 32 messages at once, I think it's impossible to scale out to more than 32 replicas

Are you interested in a pull request?

@JorTurFer
Copy link
Member

JorTurFer commented Mar 1, 2022

Hey @antonio-policastro
I think that the implementation can recover the expectation of more than 32 messages.
I'm not an expert in azure sdk, but I think that we check if they are less than 32 and if not, we check the estimate amount of them so we can get more than 32.
In fact, ApproximateMessagesCount uses x-ms-approximate-messages-count

@antonio-policastro
Copy link

antonio-policastro commented Mar 1, 2022

Oh, I see. Thanks, i completely missed that you were already using ApproximateMessagesCount when the queue has more than 32 messages.
During my tests I always had less than 32 messages.

I believe that the scaler should always use ApproximateMessagesCount and never Peek, because Peek can't see hidden messages.
What do you think?

From docs:

If you need to dynamically determine whether to adjust workloads to handle message volume, you can query approximate message count on each queue and then respond with the appropriate action. Use the Get Queue Metadata REST operation or use any of the supported Blob storage SDKs to get the approximate message count.

@JorTurFer
Copy link
Member

I'm not sure about it, peek returns the value with accuracy. ApproximateMessagesCount (at least in .NET SDK) could return more messages than actual count

@RamCohen
Copy link
Contributor

RamCohen commented Mar 1, 2022 via email

@antonio-policastro
Copy link

A scaler parameter would solve my problem (and jungopro's too), and it wouldn't be a breaking change. Upvote!
I didn't know that ApproximateMessageCount could return a wrong (higher) number. Can you point to some documentation about it? I can't find anything.

Thanks everyone for your prompt answers.

@JorTurFer
Copy link
Member

hey @antonio-policastro
This is the .NET docs about that method.

@stale
Copy link

stale bot commented Apr 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Apr 30, 2022
@stale
Copy link

stale bot commented May 7, 2022

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed May 7, 2022
Repository owner moved this from To Do to Ready To Ship in Roadmap - KEDA Core May 7, 2022
@tomkerkhove tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 3, 2022
@wallflower762
Copy link

Did anything happen with this?

I am in exactly the same situation as jungopro both with the scaling issue with visible messages, and the viability of scaled jobs.

Not including invisible messages doesn't seem right. If I have KEDA set to scale 5 messages per pod, and I have 5 messages in progress, it would take another 6 messages to make KEDA request a second pod. It also results in the race condition observed here were if all of the in progress messages are not completed within the cooldown time frame, KEDA will scale in while they are still processing.

Having the option to use approximate message count instead seems ideal.

On a side note: when using Azure functions on an App Service Plan, you can scale based on storage queue length. The implementation Azure has used for this is approximate message count, and does not offer any other method of scaling based on queue length.

A screenshot from the app service plan scale rules GUI:
image

@wallflower762
Copy link

I am not sure if my earlier comment will trigger this issue to be reopened, so I raised a feature request as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

6 participants