-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling is slow with many scalers #1449
Comments
Aside from the /scale requests handled in #1453, I think the main issue is keda/pkg/scaling/scale_handler.go Line 378 in fbfa738
I think that Get() call is not hitting the informer cache somehow and so is making it all the way to the apiserver. Will have to dig a bit on how the unstructured cache works. |
So from muddling around in the code I've confirmed there are three places that trigger requests that are getting throttled and generally gumming up the works. The first is the isScalable get which is fixed in #1453. The next is the unstructured object get mentioned above which should be hitting the controller-runtime informer cache, will dig into the more later. And then the third is the scaling get in scaling/executor/scale_scaledobjects.go. That one I think can be improved in two ways, the first is just straight up bumping the the QPS on the rest client passed in to the scale client, the second is adding some better lazy evaluation in that RequestScale method since the majority of the time the current scale isn't actually needed. More updates as I have them. |
Hmm, actually I think lazy eval isn't going to help there. How would you all feel about making some special case implementation for the common core types (basically just Deployment I think?) that uses the plain object for the Get so it can use the informer cache and keep API overhead much lower? Would still need to fall back to /scale for custom resources and for setting things, but making a little abstraction over the the scale client to short circuit things for getting the current number of replicas seems like it would be okay. Annoying to have special cases but given how often Deployments are used a small improvement there has a big impact. |
Updates on my performance adventures! The calls in scale_handler, I just hallucinated that Unstructureds support caching at all, they do not. So I've reworked that code to use a typed get if possible which does hit the cache. The executor, I put in the Deployment special case and it works very smoothly. Slightly annoying to have the special case but its well contained and with strong fallback. I'll get this all polished into PRs tomorrow morning but with all these changes, the time from start to reconciling all ~700 scaledobjects at least once goes from 3 hours to 90 seconds so I think that should help performance a lot. |
Okay all three issues have PRs open. I split them up for easier review but all three are important for different aspects of performance :) |
Running a build with the 3 PR listed here greatly improves scaling using 1000 ScaledObjects with the Kafka scaler. The creation of HPA objects still takes a long time once there are around 500+ ScaledObjects, but that would probably be another PR. Thanks to @coderanger for creating these patches. I hope we can see them merged soon. |
Thanks for the thorough analysis! I'll leave it up to @zroubalik |
@coderanger that's really awesome. Thanks for the deep research and investigation. Unfortunately, we still don't have a way how to test KEDA behavior with that many scalers to spot this problem, so this will help a lot! When implementing the relevant parts of code, first I was thinking about implementing a special cases for some resource such as I'll take a look on your PRs (and thanks for splitting them, it definitely helps with the review) 🎉 |
@coderanger would you be open to test more tweaks in your large setup? For a long time I'd like to get some data on defining https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/#controller-configurations |
@zroubalik I can try that but I don't think it will help. The limiting factor at this point isn't the reconcile time, it's the client-side API call rate limit. Doing more parallel reconciles will just leave more things blocked in a workqueue wait :) |
Are you sure that the bottleneck is the client-side API call rate limit and not the calls to external services to get the metrics (in your case the RabbitMQ)? Just to clarify, I am not trying to say you are wrong, just wanted to double check ;-) |
We tested the PR listed in this issue with 1000 Kafka ScaledObjects. Running all three patches improved the response time for scaling from unusable to roughly equivalent to running 10 operators that each handle 100 Kafka ScaledObjects, which is good. There are still performance issues with initial setup of HPAs as the number of ScaledObjects increase, however that does not affect scaling and shouldn't block the patches. For example, it takes about 2 minutes for the 1001st ScaledObject to setup its HPA. |
@mariusgrigoriu thanks for the info!
You mean that you had 1000 ScaleObjects deployed and ready (HPAs were created) and then you deployed ScaledObject number 1001, the operation for this SO takes ~2 minutes? |
The patches mentioned in this issue address the issue of scaling responsiveness. However, they don't address the issue of how long it takes for the HPAs to be created. In fairness, it should probably be logged as a separate issue. The creation of HPA starts slowing down massively after around 500 Scaled Objects. We had 1000 SO, added one more, and waited a couple minutes for that to become ready. Actual scaling of pods works well with the patches above. |
Can you give some more information on this or has this been resolved in a newer version? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
I finally go around to doing the 2.0 upgrade, and I'm a little confused at some performance issues.
I have around 700 RabbitMQ ScaledObjects. I wrote a script to convert all of them from the v1 to v2 format, and then did the upgrade. Just to see it reconcile every object once (to add the finalizers) took ~3 hours. And the logs are full of
Throttling request took 29.426205883s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/[...]/deployments/[...]
so I'm guessing it's making an excessive number of API requests, getting throttled, and taking a long time. The part I'm confused about is that a Get request like that should be in the controller-runtime cache already and so shouldn't even hit the apiserver. It wasn't clear to me where in the controller they are even coming from.Expected Behavior
2.0 upgrade should take very little time.
Actual Behavior
2.0 upgrade took many hours during which scaling was unavailable.
Steps to Reproduce the Problem
Logs from KEDA operator
Note that every one of those
...
s is a different deployment with no obvious order.Specifications
The text was updated successfully, but these errors were encountered: