Azure Storage Queue scaler is reporting wrong number of messages in queue #2385

jungopro · 2021-12-05T13:07:11Z

Report

Hello
I think the azure_queue monitor is not reading the number of messages in the queue as it should
Looking at the code here I think that when Keda calculates the "visible messages" as 0 it returns 0, even though the actual number of messages in the queue is much higher

For example:
We see "showing 0 of 128503 messages in queue" in the portal, and Keda is also reporting 0 for the metric (even though the actual count is 128k) - see the right side in the screenshot
When we work with the c# SDK for azure storage we're getting the right numbers - see left side in the screen shot (csharp code below):

public async Task Run()
        {
            m_tracer.TraceInformation($"{nameof(StorageAccountMetricsLogger)} - Starting work.");
            CloudStorageAccount storageAccount = m_aadStorageProvider.GetStorageAccount(m_queueStorageName);
            var queueClient = storageAccount.CreateCloudQueueClient();
            var queueNames = queueClient.ListQueues();
            foreach (var cloudQueue in queueNames)
            {
                var dimension = new Dimensions();
                dimension.SetDimension("Name", cloudQueue.Name);
                var instrumentor = new Instrumentor(dimension);
                await cloudQueue.FetchAttributesAsync();
                if (!cloudQueue.ApproximateMessageCount.HasValue)
                {
                    m_tracer.TraceWarning(0, $"{nameof(StorageAccountMetricsLogger)} - Failed to retrieve message count for storage queue {cloudQueue.Name}.");
                    continue;
                }
                instrumentor.LogMetric(s_monitoringStorageQueueActiveMessageCountMetricName, cloudQueue.ApproximateMessageCount.Value);
                m_tracer.TraceInformation($"{nameof(StorageAccountMetricsLogger)} - Metrics for storage queue name \"{cloudQueue.Name}\" messages = {cloudQueue.ApproximateMessageCount}");
            }
            m_tracer.TraceInformation($"{nameof(StorageAccountMetricsLogger)} - Done.");
        }

Expected Behavior

keda should report the actual # of messages in queue

Actual Behavior

keda is reporting 0 messages in queue

Steps to Reproduce the Problem

add messages to azure queue
query the queue with keda

Logs from KEDA operator

No response

KEDA Version

2.4.0

Kubernetes Version

1.19

Platform

Other

Scaler Details

Azure storage queue

Anything else?

No response

JorTurFer · 2022-01-03T15:08:36Z

Hi @jungopro
Could you share KEDA's logs? (from operator and metric server)

RamCohen · 2022-02-09T13:41:58Z

Hi @jungopro

I was unable to reproduce the issue. Note that the scaler code also peeks at the queue to try to get an accurate number in order not to use the approximation if a definite number can be obtained.
What is the size of the messages in the queue ? If they are large it may be a result of a timeout during the peek phase.

antonio-policastro · 2022-03-01T09:17:39Z

I've just run into the same problem, and it's a big one for my use case.
It happens when the queue messages are not deleted within seconds after being popped from the queue.

The scaler peeks messages, and peek returns the visibile messages in the queue.
When a client Gets a message from the queue, it doesn't delete it immediately: it only hides it for a configurable period of time (see: visibility timeout) which defaults to 30 seconds.
The message must be deleted with a separate Delete operation.

Let's say I enqueue 10 messages.
Keda peeks 10 messages and creates 10 pods.
Each pod gets a message from the queue, processes it (e.g: calls a webservice, sends an email, etc) and then deletes the message.

If the time needed to process the message is high (e.g: a few hours), Keda will peek 0 messages and will try to scale in to 0 pods while the pods are still working.

Proposed solution:
The scaler should Get Queue Metadata insteak of Peek, and then use the x-ms-approximate-messages-count response header to scale in/out.
x-ms-approximate-messages-count includes hidden messages too.

zroubalik · 2022-03-01T11:10:09Z

I am total Azure (Storage) noob, but in general, if it takes a few hours to process a message, then ScaledObject might not be the right option, you should try to use ScaleJob.

antonio-policastro · 2022-03-01T11:39:47Z

Hi @zroubalik, thanks for your answer.
I assume you are referring to the documentation.

One important consideration to make is how this pattern can work with long running executions. Imagine a deployment triggers on a RabbitMQ queue message. Each message takes 3 hours to process. It’s possible that if many queue messages arrive, KEDA will help drive scaling out to many replicas - let’s say 4. Now the HPA makes a decision to scale down from 4 replicas to 2. There is no way to control which of the 2 replicas get terminated to scale down. That means the HPA may attempt to terminate a replica that is 2.9 hours into processing a 3 hour queue message.

There are two main ways to handle this scenario.

Leverage the container lifecycle

Run as jobs

Those two ways are suggested to handle a scenario in which the queue service has no clue over "when the message processing has been completed". Azure Storage Queue does have that information, as I explained earlier.

I am reluctant to port my code from Deployments to Jobs, because

the porting has many drawbacks for my use case (which is very hard to explain in detail here).
getting the x-ms-approximate-messages-count is faster thank peeking 32 messages from the queue, especially when the messages are large
Since Azure Storage Queue can't peek more than 32 messages at once, I think it's impossible to scale out to more than 32 replicas

Are you interested in a pull request?

JorTurFer · 2022-03-01T11:43:43Z

Hey @antonio-policastro
I think that the implementation can recover the expectation of more than 32 messages.
I'm not an expert in azure sdk, but I think that we check if they are less than 32 and if not, we check the estimate amount of them so we can get more than 32.
In fact, ApproximateMessagesCount uses x-ms-approximate-messages-count

antonio-policastro · 2022-03-01T12:03:31Z

Oh, I see. Thanks, i completely missed that you were already using ApproximateMessagesCount when the queue has more than 32 messages.
During my tests I always had less than 32 messages.

I believe that the scaler should always use ApproximateMessagesCount and never Peek, because Peek can't see hidden messages.
What do you think?

From docs:

If you need to dynamically determine whether to adjust workloads to handle message volume, you can query approximate message count on each queue and then respond with the appropriate action. Use the Get Queue Metadata REST operation or use any of the supported Blob storage SDKs to get the approximate message count.

JorTurFer · 2022-03-01T12:34:28Z

I'm not sure about it, peek returns the value with accuracy. ApproximateMessagesCount (at least in .NET SDK) could return more messages than actual count

RamCohen · 2022-03-01T12:52:57Z

Right. Approximate may return >0 count when in fact there aren't any messages at all. This looks more and more like something that we can resolve with a scaler parameter that will determine whether to use peek or not

…

On Tue, Mar 1, 2022 at 2:34 PM Jorge Turrado Ferrero < ***@***.***> wrote: I'm not sure about it, peek returns the value with accuracy. ApproximateMessagesCount (at least in .NET SDK) could return more messages than actual count — Reply to this email directly, view it on GitHub <#2385 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATJ7GSL2JAHS5X5HKN2V7LU5YFGJANCNFSM5JMYOAVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

antonio-policastro · 2022-03-01T13:20:44Z

A scaler parameter would solve my problem (and jungopro's too), and it wouldn't be a breaking change. Upvote!
I didn't know that ApproximateMessageCount could return a wrong (higher) number. Can you point to some documentation about it? I can't find anything.

Thanks everyone for your prompt answers.

JorTurFer · 2022-03-01T13:58:00Z

hey @antonio-policastro
This is the .NET docs about that method.

stale · 2022-04-30T21:55:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2022-05-07T23:44:19Z

This issue has been automatically closed due to inactivity.

wallflower762 · 2022-08-15T01:22:40Z

Did anything happen with this?

I am in exactly the same situation as jungopro both with the scaling issue with visible messages, and the viability of scaled jobs.

Not including invisible messages doesn't seem right. If I have KEDA set to scale 5 messages per pod, and I have 5 messages in progress, it would take another 6 messages to make KEDA request a second pod. It also results in the race condition observed here were if all of the in progress messages are not completed within the cooldown time frame, KEDA will scale in while they are still processing.

Having the option to use approximate message count instead seems ideal.

On a side note: when using Azure functions on an App Service Plan, you can scale based on storage queue length. The implementation Azure has used for this is approximate message count, and does not offer any other method of scaling based on queue length.

A screenshot from the app service plan scale rules GUI:

wallflower762 · 2022-08-15T06:23:20Z

I am not sure if my earlier comment will trigger this issue to be reopened, so I raised a feature request as well.

jungopro added the bug Something isn't working label Dec 5, 2021

tomkerkhove added this to Roadmap - KEDA Core Feb 10, 2022

tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022

tomkerkhove changed the title ~~Keda Azure Storage Queue is reporting wrong number of messages in queue~~ Azure Storage Queue scaler is reporting wrong number of messages in queue Feb 14, 2022

antonio-policastro pushed a commit to antonio-policastro/keda that referenced this issue Mar 14, 2022

Azure queue: ApproximateMessagesCount instead of Peek (kedacore#2385)

207abdb

stale bot added the stale All issues that are marked as stale due to inactivity label Apr 30, 2022

stale bot closed this as completed May 7, 2022

Repository owner moved this from To Do to Ready To Ship in Roadmap - KEDA Core May 7, 2022

tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 3, 2022

wallflower762 mentioned this issue Aug 15, 2022

Azure storage account queue scaler should allow users to select peek or approximate method #3555

Closed

antonio-policastro added a commit to antonio-policastro/keda that referenced this issue Feb 27, 2023

Azure queue: ApproximateMessagesCount instead of Peek (kedacore#2385)

8779583

antonio-policastro mentioned this issue Feb 27, 2023

Use approximate queue length for azure queue scaler #4003

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Storage Queue scaler is reporting wrong number of messages in queue #2385

Azure Storage Queue scaler is reporting wrong number of messages in queue #2385

jungopro commented Dec 5, 2021 •

edited

Loading

JorTurFer commented Jan 3, 2022

RamCohen commented Feb 9, 2022

antonio-policastro commented Mar 1, 2022 •

edited

Loading

zroubalik commented Mar 1, 2022

antonio-policastro commented Mar 1, 2022

JorTurFer commented Mar 1, 2022 •

edited

Loading

antonio-policastro commented Mar 1, 2022 •

edited

Loading

JorTurFer commented Mar 1, 2022

RamCohen commented Mar 1, 2022 via email

antonio-policastro commented Mar 1, 2022

JorTurFer commented Mar 1, 2022

stale bot commented Apr 30, 2022

stale bot commented May 7, 2022

wallflower762 commented Aug 15, 2022

wallflower762 commented Aug 15, 2022

Azure Storage Queue scaler is reporting wrong number of messages in queue #2385

Azure Storage Queue scaler is reporting wrong number of messages in queue #2385

Comments

jungopro commented Dec 5, 2021 • edited Loading

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Jan 3, 2022

RamCohen commented Feb 9, 2022

antonio-policastro commented Mar 1, 2022 • edited Loading

zroubalik commented Mar 1, 2022

antonio-policastro commented Mar 1, 2022

JorTurFer commented Mar 1, 2022 • edited Loading

antonio-policastro commented Mar 1, 2022 • edited Loading

JorTurFer commented Mar 1, 2022

RamCohen commented Mar 1, 2022 via email

antonio-policastro commented Mar 1, 2022

JorTurFer commented Mar 1, 2022

stale bot commented Apr 30, 2022

stale bot commented May 7, 2022

wallflower762 commented Aug 15, 2022

wallflower762 commented Aug 15, 2022

jungopro commented Dec 5, 2021 •

edited

Loading

antonio-policastro commented Mar 1, 2022 •

edited

Loading

JorTurFer commented Mar 1, 2022 •

edited

Loading

antonio-policastro commented Mar 1, 2022 •

edited

Loading