Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add a Prometheus metric for measuring the scalable object loop processing deviation #4703

Merged
merged 7 commits into from
Jun 22, 2023

Conversation

JorTurFer
Copy link
Member

@JorTurFer JorTurFer commented Jun 16, 2023

Checklist

Fixes #4702

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>
@JorTurFer JorTurFer requested a review from a team as a code owner June 16, 2023 19:29
@github-actions
Copy link

Thank you for your contribution! 🙏 We will review your PR as soon as possible.

While you are waiting, make sure to:

Learn more about:

@JorTurFer JorTurFer changed the title feat: Add a Promethean metric for measuring the scalable object loop processing deviation feat: Add a Prometheus metric for measuring the scalable object loop processing deviation Jun 16, 2023
Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>
@JorTurFer
Copy link
Member Author

JorTurFer commented Jun 16, 2023

/run-e2e sequential
Update: You can check the progress here

Signed-off-by: Jorge Turrado <jorge_turrado@hotmail.es>
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, do you see any actualy latency that could be mesured and exposed here? I mean one in the loop itself? We are already reporting latency per scaler. I am not sure if this kind of metric won't be confusion for users?

@JorTurFer
Copy link
Member Author

I wonder, do you see any actualy latency that could be mesured and exposed here? I mean one in the loop itself? We are already reporting latency per scaler. I am not sure if this kind of metric won't be confusion for users?

The difference here is what are we measuring. In one case, we are measuring the trigger latency and in the other case, we are measuring the deviation between loops. Both are similar, but they are different. For example:
If I have a single ScaledObject with pollingInterval: 1 and I have 12 triggers with 100ms of latency, the latency looks nice, but the ScaledObject will have a deviation of 200 ms all the time.

These new metrics measure the difference between the expected execution time and the real execution time, so we are getting a real picture of the overload that we have, if we should have been executed the loop at X time and we execute it later, it's because something is slowing us, it could be due to the triggers' latency but also due to an overload.

If we have throttling, both metrics will increase, but the increase of keda_scaler_metrics_latency doesn't mean an overload in KEDA (because upstream can be responding slowly) and these new metrics do mean it

@JorTurFer
Copy link
Member Author

For example, this is how it looks when there is an overload (I have forced it using 1000 SO with pollingInterval:1 and cpu: 120m) and upstream without any problem:
image

This is how it looks when KEDA isn't overloaded (cpu: 1) and it's the upstream who responses slowly:
image

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers are interesting, great stuff.

Let's procced with keda_internal_scale_loop_latency as we discussed offline.

JorTurFer and others added 2 commits June 21, 2023 12:39
Signed-off-by: Jorge Turrado <jorge.turrado@scrm.lidl>
Signed-off-by: Jorge Turrado Ferrero <Jorge_turrado@hotmail.es>
@JorTurFer
Copy link
Member Author

JorTurFer commented Jun 21, 2023

/run-e2e sequential
Update: You can check the progress here

Signed-off-by: Jorge Turrado <jorge.turrado@scrm.lidl>
Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great stuff!

Signed-off-by: Jorge Turrado <jorge.turrado@scrm.lidl>
@zroubalik
Copy link
Member

zroubalik commented Jun 21, 2023

/run-e2e sequential
Update: You can check the progress here

@tomkerkhove
Copy link
Member

Let's make sure we also update our documentation to ensure everyone knows what it represents and how to interpret it

@JorTurFer
Copy link
Member Author

Let's make sure we also update our documentation to ensure everyone knows what it represents and how to interpret it

yeah, I plan to do it later on, I just did this PR before to ensure that we include it in the release (if the release would have been done today).
Docs update incoming

@zroubalik zroubalik merged commit a634b66 into kedacore:main Jun 22, 2023
@JorTurFer JorTurFer deleted the cycle-delay branch July 27, 2023 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Prometheus metric for measuring the scalable object loop processing deviation
3 participants