Scaling is slow with many scalers #1449

coderanger · 2020-12-24T06:30:01Z

I finally go around to doing the 2.0 upgrade, and I'm a little confused at some performance issues.

I have around 700 RabbitMQ ScaledObjects. I wrote a script to convert all of them from the v1 to v2 format, and then did the upgrade. Just to see it reconcile every object once (to add the finalizers) took ~3 hours. And the logs are full of Throttling request took 29.426205883s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/[...]/deployments/[...] so I'm guessing it's making an excessive number of API requests, getting throttled, and taking a long time. The part I'm confused about is that a Get request like that should be in the controller-runtime cache already and so shouldn't even hit the apiserver. It wasn't clear to me where in the controller they are even coming from.

Expected Behavior

2.0 upgrade should take very little time.

Actual Behavior

2.0 upgrade took many hours during which scaling was unavailable.

Steps to Reproduce the Problem

Have way too many ScaledObjects in a just-applied state.
Launch the manager pod.
Wait.
Profit.

Logs from KEDA operator

2020-12-24T06:21:53.539Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "...", "ScaledObject.Name": "..."}
I1224 06:22:03.061997 1 request.go:621] Throttling request took 29.430920668s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:22:13.062046 1 request.go:621] Throttling request took 29.431011286s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
2020-12-24T06:22:23.040Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "...", "ScaledObject.Name": "..."}
I1224 06:22:23.111974 1 request.go:621] Throttling request took 29.473182423s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:22:33.112340 1 request.go:621] Throttling request took 29.432306204s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:22:43.162005 1 request.go:621] Throttling request took 29.431112037s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
2020-12-24T06:22:52.538Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "...", "ScaledObject.Name": "..."}
I1224 06:22:53.162007 1 request.go:621] Throttling request took 29.475063103s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:23:03.162022 1 request.go:621] Throttling request took 29.430427868s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...
I1224 06:23:13.212043 1 request.go:621] Throttling request took 29.430636995s, request: GET:https://10.0.0.1:443/apis/apps/v1/namespaces/.../deployments/...

Note that every one of those ...s is a different deployment with no obvious order.

Specifications

KEDA Version: 2.0.0
Platform & Version: GKE
Kubernetes Version: v1.17.13-gke.2001
Scaler(s): RabbitMQ (but I don't think that directly matters)

The text was updated successfully, but these errors were encountered:

coderanger · 2020-12-26T06:47:22Z

Aside from the /scale requests handled in #1453, I think the main issue is

keda/pkg/scaling/scale_handler.go

Line 378 in fbfa738

    
           if err := h.client.Get(context.TODO(), client.ObjectKey{Namespace: obj.Namespace, Name: obj.Spec.ScaleTargetRef.Name}, unstruct); err != nil {

I think that Get() call is not hitting the informer cache somehow and so is making it all the way to the apiserver. Will have to dig a bit on how the unstructured cache works.

coderanger · 2020-12-26T06:59:08Z

The rate has been constant since the 2.0 upgrade so it's not a cold start issue, would have to be completely skipping the cache somehow.

coderanger · 2020-12-28T04:57:58Z

So from muddling around in the code I've confirmed there are three places that trigger requests that are getting throttled and generally gumming up the works. The first is the isScalable get which is fixed in #1453. The next is the unstructured object get mentioned above which should be hitting the controller-runtime informer cache, will dig into the more later. And then the third is the scaling get in scaling/executor/scale_scaledobjects.go. That one I think can be improved in two ways, the first is just straight up bumping the the QPS on the rest client passed in to the scale client, the second is adding some better lazy evaluation in that RequestScale method since the majority of the time the current scale isn't actually needed.

More updates as I have them.

coderanger · 2020-12-28T05:19:07Z

Hmm, actually I think lazy eval isn't going to help there. How would you all feel about making some special case implementation for the common core types (basically just Deployment I think?) that uses the plain object for the Get so it can use the informer cache and keep API overhead much lower? Would still need to fall back to /scale for custom resources and for setting things, but making a little abstraction over the the scale client to short circuit things for getting the current number of replicas seems like it would be okay. Annoying to have special cases but given how often Deployments are used a small improvement there has a big impact.

coderanger · 2020-12-28T10:52:11Z

Updates on my performance adventures!

The calls in scale_handler, I just hallucinated that Unstructureds support caching at all, they do not. So I've reworked that code to use a typed get if possible which does hit the cache.

The executor, I put in the Deployment special case and it works very smoothly. Slightly annoying to have the special case but its well contained and with strong fallback.

I'll get this all polished into PRs tomorrow morning but with all these changes, the time from start to reconciling all ~700 scaledobjects at least once goes from 3 hours to 90 seconds so I think that should help performance a lot.

coderanger · 2020-12-28T20:34:29Z

Okay all three issues have PRs open. I split them up for easier review but all three are important for different aspects of performance :)

mariusgrigoriu · 2020-12-29T23:45:02Z

Running a build with the 3 PR listed here greatly improves scaling using 1000 ScaledObjects with the Kafka scaler. The creation of HPA objects still takes a long time once there are around 500+ ScaledObjects, but that would probably be another PR.

Thanks to @coderanger for creating these patches. I hope we can see them merged soon.

tomkerkhove · 2021-01-04T10:05:43Z

Thanks for the thorough analysis! I'll leave it up to @zroubalik

zroubalik · 2021-01-04T11:24:53Z

@coderanger that's really awesome. Thanks for the deep research and investigation. Unfortunately, we still don't have a way how to test KEDA behavior with that many scalers to spot this problem, so this will help a lot!

When implementing the relevant parts of code, first I was thinking about implementing a special cases for some resource such as Deployments and others when doing /scale requests, for the manipulating theUnstructured part and so on. But then I was like, nah, it should hit the cache anyway, so it would be better to use unified approach in the code, to not introduce additional complexity...how wrong I was? 💩 😄

I'll take a look on your PRs (and thanks for splitting them, it definitely helps with the review) 🎉

zroubalik · 2021-01-11T18:05:38Z

@coderanger would you be open to test more tweaks in your large setup? For a long time I'd like to get some data on defining MaxConcurrentReconciles > 1 in the controller, to see whether there are any benefits.

https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/#controller-configurations

coderanger · 2021-01-11T18:54:55Z

@zroubalik I can try that but I don't think it will help. The limiting factor at this point isn't the reconcile time, it's the client-side API call rate limit. Doing more parallel reconciles will just leave more things blocked in a workqueue wait :)

zroubalik · 2021-01-12T09:46:01Z

Are you sure that the bottleneck is the client-side API call rate limit and not the calls to external services to get the metrics (in your case the RabbitMQ)? Just to clarify, I am not trying to say you are wrong, just wanted to double check ;-)

mariusgrigoriu · 2021-01-12T18:27:59Z

We tested the PR listed in this issue with 1000 Kafka ScaledObjects. Running all three patches improved the response time for scaling from unusable to roughly equivalent to running 10 operators that each handle 100 Kafka ScaledObjects, which is good.

There are still performance issues with initial setup of HPAs as the number of ScaledObjects increase, however that does not affect scaling and shouldn't block the patches. For example, it takes about 2 minutes for the 1001st ScaledObject to setup its HPA.

zroubalik · 2021-01-13T10:31:40Z

@mariusgrigoriu thanks for the info!

There are still performance issues with initial setup of HPAs as the number of ScaledObjects increase, however that does not affect scaling and shouldn't block the patches. For example, it takes about 2 minutes for the 1001st ScaledObject to setup its HPA.

You mean that you had 1000 ScaleObjects deployed and ready (HPAs were created) and then you deployed ScaledObject number 1001, the operation for this SO takes ~2 minutes?

mariusgrigoriu · 2021-01-14T17:52:23Z

The patches mentioned in this issue address the issue of scaling responsiveness. However, they don't address the issue of how long it takes for the HPAs to be created. In fairness, it should probably be logged as a separate issue.

The creation of HPA starts slowing down massively after around 500 Scaled Objects. We had 1000 SO, added one more, and waited a couple minutes for that to become ready.

Actual scaling of pods works well with the patches above.

tomkerkhove · 2021-05-10T05:21:03Z

Can you give some more information on this or has this been resolved in a newer version?

stale · 2021-10-13T18:40:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2021-10-20T19:12:46Z

This issue has been automatically closed due to inactivity.

coderanger added the bug Something isn't working label Dec 24, 2020

tomkerkhove assigned ahmelsayed and zroubalik Dec 24, 2020

coderanger mentioned this issue Dec 26, 2020

Add a cache for what resource types are scalable so not every reconcile has to do a GET to the /scale endpoint #1453

Merged

3 tasks

coderanger changed the title ~~Extremely slow initial resync~~ Scaling is slow with many scalers Dec 28, 2020

This was referenced Dec 28, 2020

When fetching the pod data for a known core Kind, use a typed client request instead of Unstructured #1457

Merged

For scaling deployments in particular, use a direct object get rather than the /scale API #1458

Merged

stale bot added the stale All issues that are marked as stale due to inactivity label Oct 13, 2021

stale bot closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling is slow with many scalers #1449

Scaling is slow with many scalers #1449

coderanger commented Dec 24, 2020 •

edited

Loading

coderanger commented Dec 26, 2020

coderanger commented Dec 26, 2020

coderanger commented Dec 28, 2020

coderanger commented Dec 28, 2020

coderanger commented Dec 28, 2020

coderanger commented Dec 28, 2020

mariusgrigoriu commented Dec 29, 2020

tomkerkhove commented Jan 4, 2021

zroubalik commented Jan 4, 2021

zroubalik commented Jan 11, 2021

coderanger commented Jan 11, 2021

zroubalik commented Jan 12, 2021

mariusgrigoriu commented Jan 12, 2021

zroubalik commented Jan 13, 2021

mariusgrigoriu commented Jan 14, 2021

tomkerkhove commented May 10, 2021

stale bot commented Oct 13, 2021

stale bot commented Oct 20, 2021

Scaling is slow with many scalers #1449

Scaling is slow with many scalers #1449

Comments

coderanger commented Dec 24, 2020 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

Specifications

coderanger commented Dec 26, 2020

coderanger commented Dec 26, 2020

coderanger commented Dec 28, 2020

coderanger commented Dec 28, 2020

coderanger commented Dec 28, 2020

coderanger commented Dec 28, 2020

mariusgrigoriu commented Dec 29, 2020

tomkerkhove commented Jan 4, 2021

zroubalik commented Jan 4, 2021

zroubalik commented Jan 11, 2021

coderanger commented Jan 11, 2021

zroubalik commented Jan 12, 2021

mariusgrigoriu commented Jan 12, 2021

zroubalik commented Jan 13, 2021

mariusgrigoriu commented Jan 14, 2021

tomkerkhove commented May 10, 2021

stale bot commented Oct 13, 2021

stale bot commented Oct 20, 2021

coderanger commented Dec 24, 2020 •

edited

Loading