Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEDA capacity is very limited with Kafka scaler #911

Closed
alexandery opened this issue Jul 3, 2020 · 12 comments
Closed

KEDA capacity is very limited with Kafka scaler #911

alexandery opened this issue Jul 3, 2020 · 12 comments
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@alexandery
Copy link

alexandery commented Jul 3, 2020

Issue description

Attempting to test how many scaled object can be supported by KEDA. Seeing issues / errors with SOs approaching 200. Running v2 alpha as suggested by Zbynek.

Details:

  1. Have Kafka cluster with 200 topics, 10 partitions each.
  2. Created 200 deployments, serving single topic each
  3. Created SOs in batches: 50, then 50 more, then 20 more at a time
    • first 50 SOs resulted in a pretty snappy behavior, few errors in keda-operator logs related to connectivity to Kafka brokers (see examples below). Published messages to all 50 topics.
    • next 50 SOs were still reasonable with few more errors observed in keda-operator. Published messages to topics 51-100.
    • each next 20 SOs were showing more and more signs of degradation - with each batch it takes much longer for KEDA to start processing newly added SOs (topics) and scale deployments; way more errors - pretty much constantly for most SOs; newly added SOs seem to be processed eventually one by one - KEDA scales one/two deployments, then does nothing for minutes (many minutes at times), then few more SOs/topics and so on. Here I've started to push messages only for newly created SOs (so, 20 topics at a time).
    • at 160 and 180 SOs/deployments I'm seeing a wall of errors and no scaling of latest added deployments (161-180) - after about 70-90 minutes the messages were finally noticed and deployments scaled to consume those.
  4. Observed keda-operator crashing, resulting in many 'evicted' instances
    • once running instance was reporting of existing lock and not being able to operate (resolved with scaling down / up of keda-operator)
    • once running instance recovered on its own and was performing scaling operations.
  5. SOs are configured with: pollingInterval=5, cooldownPeriod=10

Errors observed on keda-operator:

{"level":"error","ts":1593791461.0258286,"logger":"scalehandler","msg":"Error getting scalers","object":{"apiVersion":"keda.sh/v1alpha1","kind":"ScaledObject","namespace":"keda-group-1","name":"keda-scaleobject-143"},"error":"error getting scaler for trigger #0: error creating kafka client: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).checkScalers\n\tkeda/pkg/scaling/scale_handler.go:183\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).startScaleLoop\n\tkeda/pkg/scaling/scale_handler.go:133"}


{"level":"error","ts":1593791461.5068305,"logger":"scalehandler","msg":"Error getting scalers","object":{"apiVersion":"keda.sh/v1alpha1","kind":"ScaledObject","namespace":"keda-group-1","name":"keda-scaleobject-70"},"error":"error getting scaler for trigger #0: error creating kafka client: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\tkeda/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).checkScalers\n\tkeda/pkg/scaling/scale_handler.go:183\ngithub.com/kedacore/keda/pkg/scaling.(*scaleHandler).startScaleLoop\n\tkeda/pkg/scaling/scale_handler.go:133"}	

Resources utilization:

NAME CPU(cores) MEMORY(bytes)
keda-operator-7859ccf8b6-72dsq 208m 178Mi
keda-metrics-apiserver-8599957799-56zb8 5m 17Mi

Proposed actions:

  1. Sarama client need to be tested outside of KEDA, but in similar code design to see whether it's an issue with Sarama. We discussed this with Zbynek.
    • I have no experience with GO, so my input for this will be either low quality or take a long time
  2. Review the design of Kafka scaler
    • Each SOs is getting its own instance of Sarama client. Wonder how "heavy" that is and whether there are any alternatives to that approach.
  3. KEDA design of a single metrics service and single keda-operator needs to be questioned / discussed.
    • with high capacity demand (1000s of SOs) it would make a lot of sense to allow KEDA installation on a per-namespace basis with metrics / SOs being bound by whatever runs / configures it in that namespace
    • I know this is not how it works now and Zbynek mentioned it's possible to manually deploy keda-operator into different namespaces, but ideally the whole KEDA should be deployable with multiple instances?
  4. I have reached out to Confluent and requested information about the capacity of my cluster and whether there are any current operational details they can provide me with from their end. At this moment my understanding that it's pretty capable offering with thousands of clusters running and matching decent SLAs.
    • will report if there is anything of interest that comes out of that
  5. I will change SOs definitions and drop pollingInterval to 30 seconds, but in my opinion current level of load (200 topics to monitor) is too low to warrant degradation in performance / responsiveness
    • will report on findings once that is done

Specifications

  • KEDA Version: v2 alpha
  • Platform & Version: Azure AKS, Kafka is powered by Confluent Cloud (SaaS offering managed Kafka)
  • Kubernetes Version: 1.16.9 with 3 Azure DS2_V2 nodes (2 vCPUs, 7GB RAM)
  • Scaler(s): Kafka
@alexandery alexandery added the bug Something isn't working label Jul 3, 2020
@alexandery
Copy link
Author

@zroubalik FYI

@zroubalik
Copy link
Member

@alexandery thanks for the investigation and report!

3. KEDA design of a single metrics service and single keda-operator needs to be questioned / discussed.
   
   * with high capacity demand (1000s of SOs) it would make a lot of sense to allow KEDA installation on a per-namespace basis with metrics / SOs being bound by whatever runs / configures it in that namespace
   * I know this is not how it works now and Zbynek mentioned it's possible to manually deploy keda-operator into different namespaces, but ideally the whole KEDA should be deployable with multiple instances?

Unfortunately current kubernetes metrics server adapter implementation doesn't allow multiple instances in the cluster, that's not something that could be bypassed by KEDA. But this area definitely needs improvements.

So let's try mutliple KEDA operator deployments and each watching a single namespace and spread deployments across those namespaces.

@chinnasamyb
Copy link

Were you able to deploy multiple KEDA operators with each watching single namespace to handle the scalability issue of the scaler?

@stale
Copy link

stale bot commented Oct 13, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 13, 2021
@q26646
Copy link

q26646 commented Oct 14, 2021

Hi there - we are adopting Keda and came across this issue. Were you able to implement multiple keda operators?

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 14, 2021
@alexandery
Copy link
Author

We ended up with changes to technologies and I haven't focused on this particular issue since.

@stale
Copy link

stale bot commented Dec 17, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Dec 17, 2021
@zroubalik
Copy link
Member

zroubalik commented Dec 20, 2021

We have done performance improvements in recent KEDA versions (2.4.0 & 2.5.0)
We have identified some bottle necks in Kafka scaler, which should be fixed in 2.6.0:#2377

And there is still a problem in upstream k8s HPA controller implementation: #2382

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Dec 20, 2021
@tomkerkhove tomkerkhove moved this to Backlog in Roadmap - KEDA Core Feb 10, 2022
@tomkerkhove
Copy link
Member

@alexandery Would you be able to test with KEDA v2.6.0 please?

@tomkerkhove tomkerkhove moved this from To Do to Pending End-User Feedback in Roadmap - KEDA Core Feb 11, 2022
@alexandery
Copy link
Author

@tomkerkhove Unfortunately Kafka hasn't been a priority for us for quite some time, so I don't have the infrastructure now to test all of this again.

@stale
Copy link

stale bot commented Apr 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Apr 19, 2022

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Apr 19, 2022
Repository owner moved this from Pending End-User Feedback to Ready To Ship in Roadmap - KEDA Core Apr 19, 2022
@tomkerkhove tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

5 participants