Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polling interval causes massive CPU use #799

Closed
jocelynthode opened this issue Sep 28, 2023 · 6 comments
Closed

Polling interval causes massive CPU use #799

jocelynthode opened this issue Sep 28, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@jocelynthode
Copy link
Contributor

Report

Currently the ScaledObject polling interval is hard coded to 1 second. This seems to cause massive CPU usage on our end.

We currently have 120+ HTTPScaledObjects meaning we have 120+ ScaledObjects.

Our keda operator is hovering around 6000m of CPU Usage and the keda operator logs are littered every seconds by a lot of:

2023-09-28T05:45:25Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ldap-ui","namespace":"ldap-ui"}, "namespace": "ldap-ui", "name": "ldap-ui", "reconcileID": "07460f4d-23d2-4157-b011-bdab317e09cc"}

I would love as a workaround to be able to choose the pollingInterval I want, but I wonder if there could be another way to handle this issue with a push rather than pull method as increasing pollingInterval will solve CPU Usage but scaling will take longer.

Expected Behavior

I would expect http-add-on to not cause such massive CPU usage

Actual Behavior

The CPU usage increases linearly with ScaledObjects, needing ~6CPU for 120 HTTPScaledObjects

Steps to Reproduce the Problem

  1. Create a lot of HTTPScaledObjects
  2. Check CPU Usage of the keda-operator pod

Logs from KEDA HTTP operator

example

HTTP Add-on Version

Other

Kubernetes Version

1.25

Platform

Other

Anything else?

No response

@JorTurFer
Copy link
Member

JorTurFer commented Oct 5, 2023

Hello
Do you have any other ScaledObject apart from the generated from HTTPScaledObjects? I'm not sure if this is the root cause (I'm not saying that it's not, just that I'm not sure) because we are implementing load tests in the operator and we deal 1K ScaledObjects (pollingInterval: 1 too) with just a single CPU in ideal conditions.
I know that ideal conditions are not real, but the difference is huge. Could you have throttling in the Scaler component and it slows the operator?

@JorTurFer
Copy link
Member

Even that, I guess that we can increase the polling interval to 15 seconds in general because the current approach is already a pushing approach because we use external-push scaler, not external. It's the scaler who actively push when it's activated, so we don't need to evaluate it every second to scale up because that's implicitly done by the external-push scaler.
For scaling itself, the HPA controller request metrics every 15 seconds, so just using 15 seconds as polling interval could be enough to give fresh metrics on each request.
WDYT @tomkerkhove @t0rr3sp3dr0 ?

are you willing to contribute with the fix @jocelynthode (once they have shared their thought) ?

@jocelynthode
Copy link
Contributor Author

I would be willing to contribute the fix. I'll be in holidays for two weeks but could take this after my return :).

All our ScaledObjects are generated from HTTPScaledObjects as we only use keda in conjunction with http-add-on. (I should probably do a PR to add my company in the list of http-add-on users as we're currently using it in prod as well).

I had no idea there were load tests. My analysis on CPU usage might be wrong. I guessed this was the issue as the CPU usage has increased almost linearly with our increasing number of HTTPScaledObjects and reconciling lines are the only lines I can see in the logs and they are getting spammed a lot.

It might also be some misconfiguration on my end. We're virtually only doing scale-to-zero. Our goal with this add-on is to reduce our footprint by not running unused workload when no one's accessing it so all our HTTPScaledObjects have a min replica number of 0.

If you have some pointers for me, I would also be willing to investigate further the issue to make sure it's not caused by some configuration on my end.

The http-external-scaler seems to consume some cpu as well. Checking the usage for the past 12 hours, it's a bit lower than last time but we're around 3.5 CPU for the operator and 2CPU for the external-scaler:
image

@jocelynthode
Copy link
Contributor Author

Since upgrading to 0.6.0, the problem seems to have disappeared. I'll still submit a PR to align the interval to 15sec, but I'll go ahead and close this issue.

@JorTurFer
Copy link
Member

I think that this is the real fix introduced in v0.6.0 that has reduced the CPU: 8ea0896

@jocelynthode
Copy link
Contributor Author

Ah interesting thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

2 participants