KEDA capacity is very limited with Kafka scaler #911

alexandery · 2020-07-03T17:39:33Z

Issue description

Attempting to test how many scaled object can be supported by KEDA. Seeing issues / errors with SOs approaching 200. Running v2 alpha as suggested by Zbynek.

Details:

Have Kafka cluster with 200 topics, 10 partitions each.
Created 200 deployments, serving single topic each
Created SOs in batches: 50, then 50 more, then 20 more at a time
- first 50 SOs resulted in a pretty snappy behavior, few errors in keda-operator logs related to connectivity to Kafka brokers (see examples below). Published messages to all 50 topics.
- next 50 SOs were still reasonable with few more errors observed in keda-operator. Published messages to topics 51-100.
- each next 20 SOs were showing more and more signs of degradation - with each batch it takes much longer for KEDA to start processing newly added SOs (topics) and scale deployments; way more errors - pretty much constantly for most SOs; newly added SOs seem to be processed eventually one by one - KEDA scales one/two deployments, then does nothing for minutes (many minutes at times), then few more SOs/topics and so on. Here I've started to push messages only for newly created SOs (so, 20 topics at a time).
- at 160 and 180 SOs/deployments I'm seeing a wall of errors and no scaling of latest added deployments (161-180) - after about 70-90 minutes the messages were finally noticed and deployments scaled to consume those.
Observed keda-operator crashing, resulting in many 'evicted' instances
- once running instance was reporting of existing lock and not being able to operate (resolved with scaling down / up of keda-operator)
- once running instance recovered on its own and was performing scaling operations.
SOs are configured with: pollingInterval=5, cooldownPeriod=10

Errors observed on keda-operator:

{"level":"error","ts":1593791461.0258286,"logger":"scalehandler","msg":"Error getting scalers","object":{"apiVersion":"keda.sh/v1alpha1","kind":"ScaledObject","namespace":"keda-group-1","name":"keda-scaleobject-143"},"error":"error getting scaler for trigger #0: error creating kafka client: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).checkScalers\n\tkeda/pkg/scaling/scale_handler.go:183\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).startScaleLoop\n\tkeda/pkg/scaling/scale_handler.go:133"}


{"level":"error","ts":1593791461.5068305,"logger":"scalehandler","msg":"Error getting scalers","object":{"apiVersion":"keda.sh/v1alpha1","kind":"ScaledObject","namespace":"keda-group-1","name":"keda-scaleobject-70"},"error":"error getting scaler for trigger #0: error creating kafka client: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).checkScalers\n\tkeda/pkg/scaling/scale_handler.go:183\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).startScaleLoop\n\tkeda/pkg/scaling/scale_handler.go:133"}

Resources utilization:

NAME	CPU(cores)	MEMORY(bytes)
keda-operator-7859ccf8b6-72dsq	208m	178Mi
keda-metrics-apiserver-8599957799-56zb8	5m	17Mi

Proposed actions:

Sarama client need to be tested outside of KEDA, but in similar code design to see whether it's an issue with Sarama. We discussed this with Zbynek.
- I have no experience with GO, so my input for this will be either low quality or take a long time
Review the design of Kafka scaler
- Each SOs is getting its own instance of Sarama client. Wonder how "heavy" that is and whether there are any alternatives to that approach.
KEDA design of a single metrics service and single keda-operator needs to be questioned / discussed.
- with high capacity demand (1000s of SOs) it would make a lot of sense to allow KEDA installation on a per-namespace basis with metrics / SOs being bound by whatever runs / configures it in that namespace
- I know this is not how it works now and Zbynek mentioned it's possible to manually deploy keda-operator into different namespaces, but ideally the whole KEDA should be deployable with multiple instances?
I have reached out to Confluent and requested information about the capacity of my cluster and whether there are any current operational details they can provide me with from their end. At this moment my understanding that it's pretty capable offering with thousands of clusters running and matching decent SLAs.
- will report if there is anything of interest that comes out of that
I will change SOs definitions and drop pollingInterval to 30 seconds, but in my opinion current level of load (200 topics to monitor) is too low to warrant degradation in performance / responsiveness
- will report on findings once that is done

Specifications

KEDA Version: v2 alpha
Platform & Version: Azure AKS, Kafka is powered by Confluent Cloud (SaaS offering managed Kafka)
Kubernetes Version: 1.16.9 with 3 Azure DS2_V2 nodes (2 vCPUs, 7GB RAM)
Scaler(s): Kafka

The text was updated successfully, but these errors were encountered:

alexandery · 2020-07-03T17:58:26Z

@zroubalik FYI

zroubalik · 2020-07-08T09:44:46Z

@alexandery thanks for the investigation and report!

3. KEDA design of a single metrics service and single keda-operator needs to be questioned / discussed.
   
   * with high capacity demand (1000s of SOs) it would make a lot of sense to allow KEDA installation on a per-namespace basis with metrics / SOs being bound by whatever runs / configures it in that namespace
   * I know this is not how it works now and Zbynek mentioned it's possible to manually deploy keda-operator into different namespaces, but ideally the whole KEDA should be deployable with multiple instances?

Unfortunately current kubernetes metrics server adapter implementation doesn't allow multiple instances in the cluster, that's not something that could be bypassed by KEDA. But this area definitely needs improvements.

So let's try mutliple KEDA operator deployments and each watching a single namespace and spread deployments across those namespaces.

chinnasamyb · 2020-10-13T03:14:42Z

Were you able to deploy multiple KEDA operators with each watching single namespace to handle the scalability issue of the scaler?

stale · 2021-10-13T20:08:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

q26646 · 2021-10-14T14:10:52Z

Hi there - we are adopting Keda and came across this issue. Were you able to implement multiple keda operators?

alexandery · 2021-10-18T16:36:50Z

We ended up with changes to technologies and I haven't focused on this particular issue since.

stale · 2021-12-17T16:37:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

zroubalik · 2021-12-20T10:59:57Z

We have done performance improvements in recent KEDA versions (2.4.0 & 2.5.0)
We have identified some bottle necks in Kafka scaler, which should be fixed in 2.6.0:#2377

And there is still a problem in upstream k8s HPA controller implementation: #2382

tomkerkhove · 2022-02-11T13:19:15Z

@alexandery Would you be able to test with KEDA v2.6.0 please?

alexandery · 2022-02-11T16:41:09Z

@tomkerkhove Unfortunately Kafka hasn't been a priority for us for quite some time, so I don't have the infrastructure now to test all of this again.

stale · 2022-04-12T17:26:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2022-04-19T19:23:54Z

This issue has been automatically closed due to inactivity.

alexandery added the bug Something isn't working label Jul 3, 2020

stale bot added the stale All issues that are marked as stale due to inactivity label Oct 13, 2021

stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 14, 2021

stale bot added the stale All issues that are marked as stale due to inactivity label Dec 17, 2021

stale bot removed the stale All issues that are marked as stale due to inactivity label Dec 20, 2021

tomkerkhove added this to Roadmap - KEDA Core Feb 10, 2022

tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022

tomkerkhove moved this from To Do to Pending End-User Feedback in Roadmap - KEDA Core Feb 11, 2022

stale bot added the stale All issues that are marked as stale due to inactivity label Apr 12, 2022

zroubalik mentioned this issue Apr 13, 2022

add --concurrent-horizontal-pod-autoscaler-syncs flag to kube-controller-manager kubernetes/kubernetes#108501

Merged

stale bot closed this as completed Apr 19, 2022

Repository owner moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core Apr 19, 2022

tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEDA capacity is very limited with Kafka scaler #911

KEDA capacity is very limited with Kafka scaler #911

alexandery commented Jul 3, 2020 •

edited

Loading

alexandery commented Jul 3, 2020

zroubalik commented Jul 8, 2020

chinnasamyb commented Oct 13, 2020

stale bot commented Oct 13, 2021

q26646 commented Oct 14, 2021

alexandery commented Oct 18, 2021

stale bot commented Dec 17, 2021

zroubalik commented Dec 20, 2021 •

edited

Loading

tomkerkhove commented Feb 11, 2022

alexandery commented Feb 11, 2022

stale bot commented Apr 12, 2022

stale bot commented Apr 19, 2022

KEDA capacity is very limited with Kafka scaler #911

KEDA capacity is very limited with Kafka scaler #911

Comments

alexandery commented Jul 3, 2020 • edited Loading

Issue description

Details:

Errors observed on keda-operator:

Resources utilization:

Proposed actions:

Specifications

alexandery commented Jul 3, 2020

zroubalik commented Jul 8, 2020

chinnasamyb commented Oct 13, 2020

stale bot commented Oct 13, 2021

q26646 commented Oct 14, 2021

alexandery commented Oct 18, 2021

stale bot commented Dec 17, 2021

zroubalik commented Dec 20, 2021 • edited Loading

tomkerkhove commented Feb 11, 2022

alexandery commented Feb 11, 2022

stale bot commented Apr 12, 2022

stale bot commented Apr 19, 2022

alexandery commented Jul 3, 2020 •

edited

Loading

zroubalik commented Dec 20, 2021 •

edited

Loading