Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka Consumers Scheduling and Scaling #1537

Closed
pierDipi opened this issue Nov 24, 2021 · 20 comments
Closed

Kafka Consumers Scheduling and Scaling #1537

pierDipi opened this issue Nov 24, 2021 · 20 comments
Assignees
Labels
area/api area/broker Kafka Broker related issues area/channel Kafka Channel related issues area/control-plane area/source Kafka Source related issues kind/feature-request priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Issues which should be fixed (post-triage)

Comments

@pierDipi
Copy link
Member

pierDipi commented Nov 24, 2021

Problem

Currently, each Dispatcher replica instantiates a Consumer for each Trigger with the same consumer group. With an increasing number of Triggers, the Dispatcher resources need to be increased as well and there is no way to configure how many consumers for a specific Trigger we need to run.
This proposal will allow partitioning consumers across Dispatcher replicas and fine-grained control over the parallelism of each Trigger.

Design doc: https://docs.google.com/document/d/1UktwiDyqq07MtA7pUlahEpux5CCdAsyI6k3nkQeeqXw/edit

Persona:
Which persona is this feature for?

  • Event Consumer

Time Estimate (optional):
How many developer-days do you think this may take to resolve?

30

Additional context (optional)

/priority important-soon
/area broker
/area channel
/area source
/kind feature-request

@knative-prow-robot knative-prow-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/broker Kafka Broker related issues labels Nov 24, 2021
@knative-prow-robot
Copy link
Contributor

@pierDipi: The label(s) area/source cannot be applied, because the repository doesn't have them.

In response to this:

Problem

Currently, each Dispatcher replica instantiates a Consumer for each Trigger with the same consumer group. With an increasing number of Triggers, the Dispatcher resources need to be increased as well and there is no way to configure how many consumers for a specific Trigger we need to run.
This proposal will allow partitioning consumers across Dispatcher replicas and fine-grained control over the parallelism of each Trigger.

Persona:
Which persona is this feature for?

  • Event Consumer

Time Estimate (optional):
How many developer-days do you think this may take to resolve?

30

Additional context (optional)

/priority important-soon
/area broker
/area channel
/area source
/kind feature-request

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pierDipi
Copy link
Member Author

/triage accepted

@devguyio
Copy link
Contributor

devguyio commented Dec 9, 2021

I know I'm coming late to the party. So please excuse me. I am worried about the consequences of choosing the "ConsumerGroup" name.

  1. It has a direct connotation with Kafka's consumer groups and its symantics.
  2. If some Kafka operator (say Strimzi) decides to create a ConsumerGroup CRD for whatever reason, it'll collide with ours and will be confusing for the users.

I'm thinking of a more generic term. Maybe "DispatcherPartition" and "DispatcherPartitionGroup" or something along those lines.

@aavarghese
Copy link
Contributor

aavarghese commented Mar 18, 2022

New error preventing config maps from being created and hence dispatcher pod replicas stay in "ContainerCreating" RESOLVED IN: #2024

To reproduce: In openshift default project, install a new KafkaSource in default NS.

{"level":"info","ts":"2022-03-18T18:47:09.385Z","logger":"kafka-broker-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"Consumer\", Namespace:\"default\", Name:\"56ac1b53-cd59-456a-a19c-9666da173e63-bcc2r\", UID:\"e55fae12-73dc-451e-b76f-66c583d23c2e\", APIVersion:\"internal.kafka.eventing.knative.dev/v1alpha1\", ResourceVersion:\"24002544\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to bind resource to pod: failed to get or create data plane ConfigMap knative-eventing/kafka-source-dispatcher-2: configmaps \"kafka-source-dispatcher-2\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>","knative.dev/pod":"kafka-controller-bfcdc9d66-mfkrh"}

Missing some additional permissions in Cluster role for controller...cc: @pierDipi

@pierDipi
Copy link
Member Author

@aavarghese do you this commit? cc53b41

@aavarghese
Copy link
Contributor

I do have that commit. Still seeing the error.

@aavarghese
Copy link
Contributor

For tracking here: knative/eventing#6285

@aslom
Copy link
Member

aslom commented Apr 7, 2022

I am seeing occasional restarts of dispatcher pod - I do not have easy way to reproduce it (it happens after minutes of large amounts of events sent through multiple Kafka sources). What I see for dispatcher pod is exitCode: 143 which is not very useful

 containerStatuses:
  - containerID: cri-o://bde2f2466a34fad9b08c60a2811d0ce766517d31ac2ced8c524908b8cb92a118
    image: docker.io/aslom/knative-kafka-broker-dispatcher:e61457c06d40786eebe0cc95301d08b7d0a7dcc4738064c597de3ab9d20560f2
    imageID: docker.io/aslom/knative-kafka-broker-dispatcher@sha256:f001609cdf15be945c5107bf2075c7e111890a3aa48385aaf5eb22c6613374d6
    lastState:
      terminated:
        containerID: cri-o://f3bb256d7b3f5ff006e2f4e724db30c0cd9834b7dca3f963a786fd8545899a9f
        exitCode: 143
        finishedAt: "2022-04-07T20:49:07Z"
        message: |
          read-0","level":"INFO","level_value":20000,"egress.uid":"6633ea57-10f1-41e6-b8fd-df541f110b8c","resource.uid":"6633ea57-10f1-41e6-b8fd-df541f110b8c"}
          {"@timestamp":"2022-04-07T20:49:06.841Z","@version":"1","message":"Removed egress egress.uid=3b5ed361-1498-4f42-97a3-174289b7dd79 resource.uid=3b5ed361-1498-4f42-97a3-174289b7dd79","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.main.ConsumerDeployerVerticle","thread_name":"vert.x-eventloop-thread-0","level":"INFO","level_value":20000,"egress.uid":"3b5ed361-1498-4f42-97a3-174289b7dd79","resource.uid":"3b5ed361-1498-4f42-97a3-174289b7dd79"}
          {"@timestamp":"2022-04-07T20:49:06.853Z","@version":"1","message":"Removed egress egress.uid=4899712e-1bfe-45d3-908c-3fda45260220 resource.uid=4899712e-1bfe-45d3-908c-3fda45260220","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.main.ConsumerDeployerVerticle","thread_name":"vert.x-eventloop-thread-0","level":"INFO","level_value":20000,"egress.uid":"4899712e-1bfe-45d3-908c-3fda45260220","resource.uid":"4899712e-1bfe-45d3-908c-3fda45260220"}
          {"@timestamp":"2022-04-07T20:49:06.867Z","@version":"1","message":"Removed egress egress.uid=99472900-f4b6-4c5d-8a7a-27246751f8d2 resource.uid=99472900-f4b6-4c5d-8a7a-27246751f8d2","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.main.ConsumerDeployerVerticle","thread_name":"vert.x-eventloop-thread-0","level":"INFO","level_value":20000,"egress.uid":"99472900-f4b6-4c5d-8a7a-27246751f8d2","resource.uid":"99472900-f4b6-4c5d-8a7a-27246751f8d2"}
          {"@timestamp":"2022-04-07T20:49:06.868Z","@version":"1","message":"Freed 2 thread-local buffer(s) from thread: vert.x-eventloop-thread-0","logger_name":"io.netty.buffer.PoolThreadCache","thread_name":"vert.x-eventloop-thread-0","level":"DEBUG","level_value":10000}
          {"@timestamp":"2022-04-07T20:49:06.868Z","@version":"1","message":"Freed 2 thread-local buffer(s) from thread: vert.x-eventloop-thread-0","logger_name":"io.netty.buffer.PoolThreadCache","thread_name":"vert.x-eventloop-thread-0","level":"DEBUG","level_value":10000}
        reason: Error
        startedAt: "2022-04-07T20:42:32Z"
    name: kafka-source-dispatcher
    ready: true
    restartCount: 1
    started: true

@pierDipi
Copy link
Member Author

#2068 might be related to the restart issue.

@nikt-lsq
Copy link

When this feature becomes available will it cover kafka-broker-dispatcher as well as the kafka-source-dispatcher? - in the comments above I see only references to the kafka-source-dispatcher.

@aavarghese
Copy link
Contributor

@nikt-lsq (Auto)Scaling for KafkaBroker kafka-broker-dispatcher is currently in the works - there are PRs open for it and getting close to full support. KafkaSource scaling and scheduling is complete and has been available since release 1.6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api area/broker Kafka Broker related issues area/channel Kafka Channel related issues area/control-plane area/source Kafka Source related issues kind/feature-request priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Issues which should be fixed (post-triage)
Development

No branches or pull requests

6 participants