Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEDA][AzureEventHub] App not scaling down #972

Open
1 of 3 tasks
ttvrdon opened this issue Nov 2, 2023 · 12 comments
Open
1 of 3 tasks

[KEDA][AzureEventHub] App not scaling down #972

ttvrdon opened this issue Nov 2, 2023 · 12 comments
Labels
Needs: triage 🔍 Pending a first pass to read, tag, and assign

Comments

@ttvrdon
Copy link

ttvrdon commented Nov 2, 2023

This issue is a: (mark with an x)

  • bug report -> please search issues before submitting
  • documentation issue or request
  • regression (a behavior that used to work and stopped in a new release)

Issue description

AzureContainerApp is processing data from EventHub and is configured as follows:

  • MinReplicas = 0
  • MaxReplicas = 10

and is using KEDA Scale Rule of type azure-eventhub with following settings:

- type: azure-eventhub
  metadata:
      eventHubName: ...
      consumerGroup: ...
      blobContainer: ...
      checkpointStrategy: blobMetadata
      unprocessedEventThreshold: 64

When there was a high count of unprocessed messages, it was scaled to its defined maximum - 10 replicas.
However even the unprocessed count is low for long time now, App is not scaling down and stays in 10 replicas.

I created testing code to detect unprocessed count in EventHub, run it every approx. 30 secs (similar interval as scale rule evaluation) with following results:

10:16:15 - Unprocessed: 2
10:17:00 - Unprocessed: 1
10:17:47 - Unprocessed: 0
10:18:35 - Unprocessed: 0
10:19:21 - Unprocessed: 5
10:20:08 - Unprocessed: 1

In the Keda Source code I found that actual metrics are being logged: keda/pkg/scalers/azure_eventhub_scaler.go, lines 389, 396. These logs are not available in Azure Log Analytics. How to enable verbose logging for Keda?

Steps to reproduce

  1. Setup Keda azure-eventhub scale rule as described above
  2. Let the App scale to maximum replicas by high Event Hub messages rate
  3. Stop the excessive ingress, keep it low and monitor the App - it will not scale back down

Expected behavior
When the Unprocessed count will be lower than threshold defined, App should scale back down (gradually down to minimum replica count)

Actual behavior
Replica count stays at maximal count.

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs: triage 🔍 Pending a first pass to read, tag, and assign label Nov 2, 2023
@serpentfabric
Copy link

can you share testing code?

@ttvrdon
Copy link
Author

ttvrdon commented Nov 6, 2023

TestingApp.zip

using Azure.Identity;
using Azure.Storage.Blobs.Models;
using Azure.Storage.Blobs;
using Azure.Messaging.EventHubs.Consumer;

var ehNamespace = "<EventHub NamespaceName>";
var ehSharedKeyName = "<KeyName>";
var ehSharedKey = "<KeyValue>";
var ehName = "<EventHubName>";

var consumerGroup = "<ConsumerGroupName>";

var storageAccountUrl = "<StorageAccountUrl>";
var storageContainerName = "<ContainerName>";

// EH Client
var ehConsumerClient = new EventHubConsumerClient(consumerGroup, $"Endpoint=sb://{ehNamespace}.servicebus.windows.net/;SharedAccessKeyName={ehSharedKeyName};SharedAccessKey={ehSharedKey};EntityPath={ehName}");
var partitionIds = await ehConsumerClient.GetPartitionIdsAsync();

// Checkpoint blobs from SA
var checkpointBlobs = await GetCheckpointBlobClients(new Uri(storageAccountUrl), storageContainerName, ehNamespace, ehName, consumerGroup);

// Get Unprocessed events count - each 30sec
while (true)
{
    var checkpoints = await GetCheckpoints(checkpointBlobs);

    long unprocessed = 0;
    foreach (var partitionId in partitionIds)
    {
        var props = await ehConsumerClient.GetPartitionPropertiesAsync(partitionId);
        unprocessed += props.LastEnqueuedSequenceNumber - checkpoints[partitionId].sequencenumber;
    }

    Console.WriteLine($"{DateTime.UtcNow} - Unprocessed: {unprocessed}");

    await Task.Delay(TimeSpan.FromSeconds(30));
}

static async Task<Dictionary<string, (long offset, long sequencenumber)>> GetCheckpoints(IList<(string partitionId, BlobClient blobClient)> checkpointBlobClients)
{
    var checkpoints = new Dictionary<string, (long offset, long sequencenumber)>();

    foreach (var checkpoint in checkpointBlobClients)
    {
        var props = await checkpoint.blobClient.GetPropertiesAsync();

        var offset = long.Parse(props.Value.Metadata["offset"]);
        var sequenceNumber = long.Parse(props.Value.Metadata["sequencenumber"]);

        checkpoints[checkpoint.partitionId] = (offset, sequenceNumber);
    }

    return checkpoints;
}

static async Task<IList<(string partitionId, BlobClient blobClient)>> GetCheckpointBlobClients(Uri storageAccountUrl, string containerName, string ehNamespace, string ehName, string consumerGroup)
{
    var blobServiceClient = new BlobServiceClient(storageAccountUrl, new DefaultAzureCredential());
    var containerClient = blobServiceClient.GetBlobContainerClient(containerName);

    var checkpointBlobs = new List<(string partitionId, BlobClient blobClient)>();

    await foreach (BlobItem blobItem in containerClient.GetBlobsAsync(prefix: $"{ehNamespace}.servicebus.windows.net/{ehName}/{consumerGroup}/checkpoint"))
    {
        var partitionId = blobItem.Name.Substring(blobItem.Name.LastIndexOf('/') + 1);
        var blobClient = containerClient.GetBlobClient(blobItem.Name);

        checkpointBlobs.Add((partitionId, blobClient));
    }

    return checkpointBlobs;
}

@joeklin
Copy link

joeklin commented Dec 19, 2023

We are seeing the same issue with Redis Streams. App successfully scaled to 10 replicas but didn't scale down after all messages had been ack'd

@serpentfabric
Copy link

in our case we screwed up one of the secrets. but without that verbose logging, we had no idea keda was rejected its inputs and scaling out due to that. so i suspect that'll be your issue too, it's just kinda hard to tell what/why without visibility if you're not really careful to inspect every value/secret given to keda via ACA.

@shibayan
Copy link

shibayan commented Apr 16, 2024

I am encountering the same issue. I created the same azure-eventhub scale rule and used Dapr to process all Event Hubs messages and it did not scale down. I am thinking that the checkpoints are not being shared correctly as the scale down took place once the TTL of the message passed. (Checkpoint settings should be correct for KEDA / Dapr)

@patelriki13
Copy link

Any updates on this?

I am also facing same issue.

@goncalo-oliveira
Copy link

goncalo-oliveira commented Jul 8, 2024

I'm facing a similar issue... running 4 container apps with the KEDA azure-eventhub scale rule and had two of them that were always maxed out, even though the number of messages coming in doesn't reflect the scaling; even when there's nothing coming in, the apps keep scaled at max.

I've reviewed the configuration and found that actually the two apps that were fine, in reality were not properly configured. When the configuration was adjusted, they started suffering from the same issue.

activationUnprocessedEventThreshold: 10
blobContainer: <container_name>
connectionFromEnv: <connection_env>
consumerGroup: <consumer_group>
eventHubNameFromEnv: <hub_name_env>
storageConnectionFromEnv: <storage_connection_env>
unprocessedEventThreshold: 64

Are there any updated on this?

@goncalo-oliveira
Copy link

Alright... sorted out my own issue. Looking at the latest version of the scaler (2.14), I've found this new parameter (or at least, I don't remember seeing it before).

checkpointStrategy - configure the checkpoint behaviour of different Event Hub SDKs. (Values: azureFunction, blobMetadata, goSdk, default: "", Optional)

And a bit further it says

When no checkpoint strategy is specified, the Event Hub scaler will use backwards compatibility and able to scale older implementations of C#, Python or Java Event Hub SDKs. (see “Legacy checkpointing”). If this behaviour should be used, blobContainer is also required.

It came as a surprise that the default would be legacy checkpointing, to be honest, I expected the other way around. Nonetheless, after setting this to blobMetadata to suit my case, the auto-scaler started working.

Hopefully this will help someone in a similar situation.

@Nhattd97
Copy link

Nhattd97 commented Nov 7, 2024

This is related to this KEDA issue kedacore/keda#6084. The KEDA team fixed it and released it in v2.16 kedacore/keda#6260. Can we upgrade the ACA to use this version of KEDA? @tomkerkhove , could you please help take a look? Thanks

@tomkerkhove
Copy link
Member

I don't work on Azure Container Apps so can't help - Sorry.

This usually takes some time though, KEDA releases need to mature first before building an SLA-based service on top of it.

@Nhattd97
Copy link

Nhattd97 commented Nov 8, 2024

Thanks @tomkerkhove for your information.

@MarkOwen-CoditMT
Copy link

Any updates on this, what version of KEDA is used on ACA. I have an issue with my azure event hub and the lack of documentation is not helping and thus makes it hard to debug through the system logs as the error message is not documented. (only the Keda documentation is provided, unlike the service bus keda scaling is explained in great detail!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs: triage 🔍 Pending a first pass to read, tag, and assign
Projects
None yet
Development

No branches or pull requests

9 participants