Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventHub plugin reports strange values periodically #4250

Closed
ilya-scale opened this issue Feb 17, 2023 · 24 comments · Fixed by #5125
Closed

EventHub plugin reports strange values periodically #4250

ilya-scale opened this issue Feb 17, 2023 · 24 comments · Fixed by #5125
Labels
bug Something isn't working stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@ilya-scale
Copy link

Report

I have connected KEDA to event hub

      spec:
        minReplicaCount: 2
        maxReplicaCount: 8
      triggers:
      - type: azure-eventhub
        metadata:
          storageAccountName: myaccount
          blobContainer: checkpoints
          eventHubNamespace: mynamespace
          eventHubName: myhub
          consumerGroup: myconsumer
          checkpointStrategy: blobMetadata
          unprocessedEventThreshold: "5000"

There seem to be a number of issues with the connection:

  1. The connection seems to work but I often get very strange values. I monitor them via grafana and prometheus monitor.
    The value is negative and extremely low:

image

  1. I also get these logs every 10 seconds or even more often:

2023-02-17T11:58:41Z INFO Reconciling ScaledObject {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"xxx","namespace":"xxx"}, "namespace": "xxx", "name": "data-ingest-iot", "reconcileID": "e272cdaa-8474-421e-9ca4-9fae7bda42f4"}

Which is probably strange since the default check period is every 30 seconds. I have a service bus trigger on another ScaledObject and neither of these things happen (no logs for reconciling, no "off" values either)

  1. It seems that the value of the metric is capped at unprocessedEventThreshold * maxReplicaCount as very often I get value 40000 which is quite strange. So I think the value should not be capped. The value itself seems to be off as well, somewhat like in the point 1. but I cannot see it since it is capped.

Expected Behavior

  1. There should not be any values that are "off"
  2. There probably should not be reconciling every 10 seconds in the log (not sure about that)
  3. There should not be any "cap" on the metric value

Actual Behavior

  1. Strange values appear periodically
  2. Logs indicate that probably something is wrong
  3. The value of the metric is capped

Steps to Reproduce the Problem

  1. Create an eventhub with some messages coming through
  2. Register trigger with the details posted above
  3. Run for some time and observe this behaviour

P.S. The negative value does not seem to happen in every environment where I run k8s.

Logs from KEDA operator

example

KEDA Version

2.9.3

Kubernetes Version

1.24

Platform

Microsoft Azure

Scaler Details

EventHubs

Anything else?

No response

@ilya-scale ilya-scale added the bug Something isn't working label Feb 17, 2023
@stale
Copy link

stale bot commented Apr 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Apr 18, 2023
@JorTurFer
Copy link
Member

Does this behavior still happen in v2.10?

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Apr 23, 2023
@ilya-scale
Copy link
Author

I am running in a development environment the upgraded version (2.10.1) and it actually does not seem to happen anymore. I need some more time to verify that this is indeed the case and upgrade production before I can tell for sure.

Was there something that was fixed in 2.10 to make that happen?

@JorTurFer
Copy link
Member

If I'm not wrong (and I can be), we upgraded the dep azure-amqp-common-go from v3 to v4

@ilya-scale
Copy link
Author

ilya-scale commented Apr 25, 2023

It seems it did not help after all:

image

The 0 on the chart is not the 0, it is some real value (around 200), but as you see sometimes it goes to a completely weird value (-9223372036854776)

@JorTurFer
Copy link
Member

If you check the values using prometheus directly, do you see the same? I have had some renderization issues with grafana some times

@ilya-scale
Copy link
Author

ilya-scale commented Apr 25, 2023

I am not using a dashboard, but a raw Metric from explore:

image

Would that be enough, or you think it can still render wrong? (The picture here does not have the error as it does not happen every time)

@JorTurFer
Copy link
Member

I still think that it can be a render error, I have faced with them sometimes in grafana :/
The problem is that value doesn't make sense :( maybe you could try with different time windows when you see the error to ensure that the negative peaks are at the same time. I mean, if you have the last 3 hours and you see an error 20 minutes ago, try again with the last hour and check if the error is at the same time (if the peak is at the same time with 2 different visualizations, looks like metric problem)

@ilya-scale
Copy link
Author

I have some issues with persistence, so I cannot go back in time to check unfortunately, but let me wait until it pops up again, and I can check raw prometheus data from prometheus pod to double check

@ilya-scale
Copy link
Author

Here is the result from raw prometheus data (grafana is not involved):

image

So I guess that proves the issue is with the raw data?

@JorTurFer
Copy link
Member

So, definitively KEDA is who is exposing that value :/
Do you have any idea about why this could be happening @v-shenoy @tomkerkhove ?

@ilya-scale
Copy link
Author

ilya-scale commented Apr 25, 2023

I can also add that my setup is 2 - 8 replicas, since I have 8 partitions in a hub. Usually it runs with 2 pods, each listening to 4 partitions each.

Azure.Messaging.EventHubs.Processor 5.7.5, Azure Premium Storage Account.

@stale
Copy link

stale bot commented Jun 24, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Jun 24, 2023
@stale
Copy link

stale bot commented Jul 1, 2023

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Jul 1, 2023
@ilya-scale
Copy link
Author

ilya-scale commented Jul 3, 2023

I suppose the issue was not really fixed, just closed? Can we reopen it?

@JorTurFer JorTurFer reopened this Jul 3, 2023
@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Jul 3, 2023
@JorTurFer
Copy link
Member

sure!
One question, how often does it happen? we have recently released a product that uses EventHub, it's just to check it looking for this weird values

@ilya-scale
Copy link
Author

I have now turned the whole setup off since it is not possible to use with this issue, but as far as I remember it happened relatively often, more than once a day.

@JorTurFer
Copy link
Member

I'll check it tomorrow morning

@JorTurFer
Copy link
Member

In KEDA v2.10.1 still happens:
image

Do you have any idea about the root cause @tomkerkhove @v-shenoy . I guess that it's related with the circular buffer of the EventHub, but I'm not sure

@stale
Copy link

stale bot commented Sep 3, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Sep 3, 2023
@ilya-scale
Copy link
Author

I think this is still an active issue that should not be closed

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Sep 5, 2023
@JorTurFer JorTurFer added the stale-bot-ignore All issues that should not be automatically closed by our stale bot label Sep 5, 2023
@JorTurFer
Copy link
Member

JorTurFer commented Sep 5, 2023

@tomkerkhove , do you think that we could get a revision from anyone with knowledge about the eventhub SDK?
Maybe we are doing something wrong, or it's something wrong at some point. I do confirm that this happens (I see it also in my own products)

@tomkerkhove
Copy link
Member

Can you summarize the issue please?

@JorTurFer
Copy link
Member

JorTurFer commented Sep 10, 2023

The OP is using evenhub scaler and they monitor the exposed value during the time through KEDA's prometheus metrics. The problem is that periodically, KEDA is reporting a weird value:
image

I have seen the same weird behavior in our own clusters (SCRM) and I have the suspicious that it's because we are doing something wrong calculating the lag.
I'm not an expert in how event hub works, so maybe the issue is in any other place, but if someone who knows about event hub and the golang SDK could take a look and give us some feedback, it'd be nice :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants