Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from 1.13 to 1.14 leaves triggers in broken state #4091

Closed
mdwhitley opened this issue Sep 3, 2024 · 51 comments
Closed

Upgrading from 1.13 to 1.14 leaves triggers in broken state #4091

mdwhitley opened this issue Sep 3, 2024 · 51 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mdwhitley
Copy link

Describe the bug
Upgrading from knative=1.13.8 and kafka-extensions=1.13.11.

knative-eventing related resources update and restart as expected.
eventing-kafka extension related resource upgrade fails:

  1. kafka-broker-dispatcher statefulset set starts at 3 replicas and eventually scales up to 15 replicas in our dev environment. Only 3/15 replicas end up with data populated to the corresponding kafka-broker-dispatcher-X configmap resource. I see a continued process of the pods going through container creating state as it waits for the CM to get created and then throws an error when it finally starts up
{"@timestamp":"2024-08-30T21:12:19.847Z","@version":"1","message":"failed to parse from JSON","logger_name":"dev.knative.eventing.kafka.broker.core.eventbus.ContractPublisher","thread_name":"contract-file-watcher","level":"ERROR","level_value":40000,"stack_trace":"com.google.protobuf.InvalidProtocolBufferException: Expect message object but got: null\n\tat com.google.protobuf.util.JsonFormat$ParserImpl.mergeMessage(JsonFormat.java:1481)\n\tat com.google.protobuf.util.JsonFormat$ParserImpl.merge(JsonFormat.java:1458)\n\tat com.google.protobuf.util.JsonFormat$ParserImpl.merge(JsonFormat.java:1322)\n\tat com.google.protobuf.util.JsonFormat$Parser.merge(JsonFormat.java:486)\n\tat dev.knative.eventing.kafka.broker.core.eventbus.ContractPublisher.parseFromJson(ContractPublisher.java:98)\n\tat dev.knative.eventing.kafka.broker.core.eventbus.ContractPublisher.updateContract(ContractPublisher.java:74)\n\tat dev.knative.eventing.kafka.broker.dispatcher.main.Main.lambda$start$0(Main.java:137)\n\tat dev.knative.eventing.kafka.broker.core.file.FileWatcher.run(FileWatcher.java:124)\n\tat java.base/java.lang.Thread.run(Thread.java:1595)\n"}
  1. The kafka-controller-post-install job completes and outputs no logs.

If (2) is replaced with a manual step to cleanup the old dispatcher deployment, all event traffic ceases and triggers go into a failed state:

dev020-ke.a.a.v1.training-complete               knative-wa-a-broker                     107d   False   Schedule
dev020-ke.a.a.v1.training-failed                 knative-wa-a-broker                     107d   False   Schedule
dev020-ke.a.a.v2.training-start                  knative-wa-a-broker                     462d   False   Schedule
dev020-ke.a.c-c.v1.training-complete          knative-wa-c-broker                           386d   False   Schedule
dev020-ke.a.c-c.v1.training-failed            knative-wa-c-broker                           386d   False   PropagateSubscriberURI

Recreation of any trigger does not help. Existing broker resources continue to report ready but if any are recreated will also go into a failed state. A describe on any trigger shows a failure on old/no-longer-existing dispatcher deployment

  - lastTransitionTime: "2024-08-30T20:23:29Z"
    message: 'failed to schedule consumers: failed to get data plane volume kafka-resources
      in pod knative-eventing/kafka-broker-dispatcher-6b6855d6c5-7nhtw'
    reason: Schedule
    status: "False"
    type: Ready

Expected behavior

To Reproduce
Steps to reproduce the behavior.

Knative release version
knative=1.13.8 and kafka-extensions=1.13.11

Additional context
Add any other context about the problem here such as proposed priority

@mdwhitley mdwhitley added the kind/bug Categorizes issue or PR as related to a bug. label Sep 3, 2024
@pierDipi
Copy link
Member

pierDipi commented Sep 10, 2024

Hi @mdwhitley, thanks for reporting, can you describe what is present in your knative-eventing namespace (deployments, pods, statefulsets)?

How many Kafka brokers and triggers do you have in your cluster?

@mdwhitley
Copy link
Author

mdwhitley commented Sep 10, 2024

Our 1.13 environment knative-eventing namespace

pods:

NAME                                          READY   STATUS    RESTARTS   AGE
eventing-controller-785f6cfbcf-9jzb5          1/1     Running   0          9d
eventing-controller-785f6cfbcf-nnn8j          1/1     Running   0          10d
eventing-controller-785f6cfbcf-nxb4c          1/1     Running   0          19h
eventing-webhook-7bc98f9577-6v6z2             1/1     Running   0          5d22h
eventing-webhook-7bc98f9577-w9xkp             1/1     Running   0          9d
kafka-broker-dispatcher-6b6855d6c5-ccxdx      1/1     Running   0          10d
kafka-broker-dispatcher-6b6855d6c5-lxhhd      1/1     Running   0          3d14h
kafka-broker-receiver-5bccbdbdcf-2zggd        1/1     Running   0          10d
kafka-broker-receiver-5bccbdbdcf-r8g4c        1/1     Running   0          9d
kafka-controller-5674695d4-bw9b8              1/1     Running   0          10d
kafka-controller-5674695d4-jrgkr              1/1     Running   0          5d8h
kafka-controller-5674695d4-ww2zx              1/1     Running   0          10d
kafka-webhook-eventing-7544bd5884-8gncc       1/1     Running   0          10d
kafka-webhook-eventing-7544bd5884-hg4ll       1/1     Running   0          10h
kafka-webhook-eventing-7544bd5884-t92st       1/1     Running   0          10d
ke-sample-api-service-9d5f7c7d9-drpnd         1/1     Running   0          13d
ke-sample-training-service-5bcc4bc485-9wgrx   1/1     Running   0          13d
knative-sandbox-dlq-6fb6bc87f8-qnl2k          1/1     Running   0          13d

deployments:

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
event-display-mock-clu-training   0/0     0            0           601d
eventing-controller               3/3     3            3           653d
eventing-webhook                  2/2     2            2           653d
kafka-broker-dispatcher           2/2     2            2           10d
kafka-broker-receiver             2/2     2            2           653d
kafka-controller                  3/3     3            3           653d
kafka-webhook-eventing            3/3     3            3           653d
ke-sample-api-service             1/1     1            1           652d
ke-sample-training-service        1/1     1            1           652d
knative-sandbox-dlq               1/1     1            1           652d
pingsource-mt-adapter             0/0     0            0           586d

No statefulsets

3 brokers, 266 triggers (prod will have 2-3x as many triggers in large clusters).

post-install jobs for both knative and kafka extensions complete except for eventing-kafka-broker/control-plane/cmd/post-install which as described above, runs & completes successfully with no logging and no changes to the cluster.

@pierDipi
Copy link
Member

pierDipi commented Sep 10, 2024

I mean during/after the upgrade process, how are you upgrading the system?

With something like this?

kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.10/eventing-kafka-controller.yaml 
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.10/eventing-kafka-broker.yaml 
kubectl apply -f https://github.com/knative-extensions/eventing-kafka-broker/releases/download/knative-v1.14.10/eventing-kafka-post-install.yaml

@mdwhitley
Copy link
Author

mdwhitley commented Sep 10, 2024

Environment after 1.14 upgrade. This runs all knative templates+post-install followed by eventing-kafka templates+post-install.

pods:

NAME                                           READY   STATUS      RESTARTS   AGE
eventing-controller-575bc94c49-75rx5           1/1     Running     0          3m34s
eventing-controller-575bc94c49-c7x5q           1/1     Running     0          4m36s
eventing-controller-575bc94c49-ckdlx           1/1     Running     0          4m5s
eventing-webhook-65b4948885-4xrgf              1/1     Running     0          4m29s
eventing-webhook-65b4948885-6wx8q              1/1     Running     0          4m36s
kafka-broker-dispatcher-0                      1/1     Running     0          2m47s
kafka-broker-dispatcher-1                      1/1     Running     0          2m47s
kafka-broker-dispatcher-6b6855d6c5-ccxdx       1/1     Running     0          10d
kafka-broker-dispatcher-6b6855d6c5-lxhhd       1/1     Running     0          3d14h
kafka-broker-receiver-54d9dcc8cd-9qv78         1/1     Running     0          2m13s
kafka-broker-receiver-54d9dcc8cd-m9bdl         1/1     Running     0          2m47s
kafka-controller-6f889fcc6b-crjrh              1/1     Running     0          2m38s
kafka-controller-6f889fcc6b-lf55k              1/1     Running     0          2m43s
kafka-controller-6f889fcc6b-r62wv              1/1     Running     0          2m48s
kafka-controller-post-install-fdxkc            0/1     Completed   0          2m47s
kafka-webhook-eventing-8686d549db-78m6x        1/1     Running     0          2m36s
kafka-webhook-eventing-8686d549db-9ztlk        1/1     Running     0          2m41s
kafka-webhook-eventing-8686d549db-zhn6x        1/1     Running     0          2m47s
ke-sample-api-service-9d5f7c7d9-drpnd          1/1     Running     0          13d
ke-sample-training-service-5bcc4bc485-9wgrx    1/1     Running     0          13d
knative-kafka-storage-version-migrator-62cxr   0/1     Completed   0          2m47s
knative-sandbox-dlq-6fb6bc87f8-qnl2k           1/1     Running     0          13d
storage-version-migration-eventing-9t9pv       0/1     Completed   0          4m34s

Note controlplane post-install job completed successfully prior to the 2 new dispatcher pods becoming available.

Deployments:

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
event-display-mock-clu-training   0/0     0            0           601d
eventing-controller               3/3     3            3           653d
eventing-webhook                  2/2     2            2           653d
kafka-broker-dispatcher           2/2     2            2           10d
kafka-broker-receiver             2/2     2            2           653d
kafka-controller                  3/3     3            3           653d
kafka-webhook-eventing            3/3     3            3           653d
ke-sample-api-service             1/1     1            1           652d
ke-sample-training-service        1/1     1            1           652d
knative-sandbox-dlq               1/1     1            1           652d
pingsource-mt-adapter             0/0     0            0           586d

statefulsets:

NAME                      READY   AGE
kafka-broker-dispatcher   2/2     4m30s

After completion all 266 triggers have gone unavailable and eventing cluster is down.

@pierDipi
Copy link
Member

pierDipi commented Sep 10, 2024

Post-install job is supposed to delete the old dispatcher deployment, a manual deletion should help

kubectl delete deployments -n knative-eventing kafka-broker-dispatcher

@mdwhitley
Copy link
Author

mdwhitley commented Sep 10, 2024

I've been manually deleting the deployments in previous attempts as well.

After deleting deployment:

NAME                                           READY   STATUS      RESTARTS   AGE
eventing-controller-76f978475-ctb67            1/1     Running     0          100s
eventing-controller-76f978475-l8p98            1/1     Running     0          39s
eventing-controller-76f978475-wlpcv            1/1     Running     0          70s
eventing-webhook-79f898797-ltzwh               1/1     Running     0          100s
eventing-webhook-79f898797-rtw4r               1/1     Running     0          90s
kafka-broker-dispatcher-0                      1/1     Running     0          5m54s
kafka-broker-dispatcher-1                      1/1     Running     0          5m54s
kafka-broker-receiver-5c6f978479-brnmj         1/1     Running     0          100s
kafka-broker-receiver-5c6f978479-kk7gj         1/1     Running     0          60s
kafka-controller-77b6997bc8-29wnn              1/1     Running     0          88s
kafka-controller-77b6997bc8-m26tt              1/1     Running     0          91s
kafka-controller-77b6997bc8-pgwxx              1/1     Running     0          100s
kafka-controller-post-install-6dtnw            0/1     Completed   0          5m53s
kafka-webhook-eventing-5f55558b96-jlllw        1/1     Running     0          87s
kafka-webhook-eventing-5f55558b96-rfhwg        1/1     Running     0          91s
kafka-webhook-eventing-5f55558b96-rmtpf        1/1     Running     0          100s
ke-sample-api-service-ffd8dc566-7lgxh          1/1     Running     0          100s
ke-sample-training-service-67fc74d7bd-r5pxk    1/1     Running     0          100s
knative-kafka-storage-version-migrator-9d6mn   0/1     Completed   0          5m53s
knative-sandbox-dlq-cc6fd94c5-7sfg7            1/1     Running     0          99s
storage-version-migration-eventing-nj2tn       0/1     Completed   0          7m34s

Eventing cluster is still completely down. This is the 2nd attempt today and now 263/266 triggers are all failed with

  - lastTransitionTime: "2024-09-10T16:33:09Z"
    message: "failed to initialize consumer group offset: cannot obtain Kafka cluster
      admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t*
      EOF\n\t* EOF\n\t* EOF\n"
    reason: InitializeOffset
    status: "False"
    type: DependencyReady

I also observe on the kafka side a large chunk of consumers (presumably knative) have gone inactive.

A couple triggers are erroring with:

  - lastTransitionTime: "2024-09-10T16:40:06Z"
    message: 'failed to schedule consumers: insufficient running pods replicas for
      StatefulSet knative-eventing/kafka-broker-dispatcher to schedule resource replicas
      (left: 1): retry requeue after: 5s'
    reason: Schedule
    status: "False"
    type: DependencyReady

@pierDipi
Copy link
Member

pierDipi commented Sep 10, 2024

\n\t* EOF\n\t* EOF\n\t* EOF\n

That is an error we get from the Kafka cluster.


    message: 'failed to schedule consumers: insufficient running pods replicas for
      StatefulSet knative-eventing/kafka-broker-dispatcher to schedule resource replicas
      (left: 1): retry requeue after: 5s'

the autoscaler should scale up the statefulset replicas, are you setting a specific replicas number for the statefulset kafka-broker-dispatcher?

@mdwhitley
Copy link
Author

While pulling additional logs there was a burst of activity scaling up the statefulset. I had observed this happen previously as well:

NAME                                          READY   STATUS    RESTARTS   AGE
eventing-controller-76f978475-ctb67           1/1     Running   0          10m
eventing-controller-76f978475-l8p98           1/1     Running   0          9m39s
eventing-controller-76f978475-wlpcv           1/1     Running   0          10m
eventing-webhook-79f898797-ltzwh              1/1     Running   0          10m
eventing-webhook-79f898797-rtw4r              1/1     Running   0          10m
kafka-broker-dispatcher-0                     1/1     Running   0          14m
kafka-broker-dispatcher-1                     1/1     Running   0          14m
kafka-broker-dispatcher-10                    1/1     Running   0          2m8s
kafka-broker-dispatcher-11                    1/1     Running   0          2m8s
kafka-broker-dispatcher-12                    1/1     Running   0          2m8s
kafka-broker-dispatcher-13                    1/1     Running   0          2m8s
kafka-broker-dispatcher-14                    1/1     Running   0          2m8s
kafka-broker-dispatcher-2                     1/1     Running   0          2m8s
kafka-broker-dispatcher-3                     1/1     Running   0          2m8s
kafka-broker-dispatcher-4                     1/1     Running   0          2m8s
kafka-broker-dispatcher-5                     1/1     Running   0          2m8s
kafka-broker-dispatcher-6                     1/1     Running   0          2m8s
kafka-broker-dispatcher-7                     1/1     Running   0          2m8s
kafka-broker-dispatcher-8                     1/1     Running   0          2m8s
kafka-broker-dispatcher-9                     1/1     Running   0          2m8s
kafka-broker-receiver-5c6f978479-brnmj        1/1     Running   0          10m
kafka-broker-receiver-5c6f978479-kk7gj        1/1     Running   0          10m
kafka-controller-77b6997bc8-29wnn             1/1     Running   0          10m
kafka-controller-77b6997bc8-m26tt             1/1     Running   0          10m
kafka-controller-77b6997bc8-pgwxx             1/1     Running   0          10m
kafka-webhook-eventing-5f55558b96-jlllw       1/1     Running   0          10m
kafka-webhook-eventing-5f55558b96-rfhwg       1/1     Running   0          10m
kafka-webhook-eventing-5f55558b96-rmtpf       1/1     Running   0          10m
ke-sample-api-service-ffd8dc566-7lgxh         1/1     Running   0          10m
ke-sample-training-service-67fc74d7bd-r5pxk   1/1     Running   0          10m
knative-sandbox-dlq-cc6fd94c5-7sfg7           1/1     Running   0          10m

The primary difference is now I see our triggers becoming ready again. The new dispatcher pods actually came up completely this time vs the error reported in my original issue here.

This is the result of:

  1. Deploy 1.14 -> failed eventing cluster
  2. Redeploy 1.13, delete 1.14 statefulset -> back working
  3. Deploy 1.14 -> failed eventing cluster
  4. Manually delete dispatcher deployment -> still not working
  5. ~10 minutes additional dispatcher pods were created and everything came back online.

That leaves questions on why the post-install job that should cleanup the old dispatcher deployment is silently failing instead of waiting until the statefulset is completely available, and why after 1.14 is installed + old dispatcher deployment removed we have a period of time where our eventing cluster is down.

@mdwhitley
Copy link
Author

are you setting a specific replicas number for the statefulset kafka-broker-dispatcher?

Our default defines 2 replicas for the statefulset but it appears it auto-scales itself up to 15. When this happened earlier it blew up our k8 cluster quota so we had to increase resources. I am interested why it decided to scale up because increasing # deployments x7 is steep and we likely need a way to control that for production.

@pierDipi
Copy link
Member

pierDipi commented Sep 11, 2024

The number of replicas depends on the configured POD_CAPACITY, we default it to 20 but it can be increased

# The number of virtual replicas each adapter pod can handle.
- name: POD_CAPACITY
value: '20'

Each trigger represents a single "virtual replica" so you get <num triggers> / <POD_CAPACITY> replicas

@mdwhitley
Copy link
Author

mdwhitley commented Sep 11, 2024

Over night our 1.14 dev cluster went back to unavailable. All triggers flipped back to not ready with:

  - lastTransitionTime: "2024-09-10T22:14:41Z"
    message: "failed to initialize consumer group offset: cannot obtain Kafka cluster
      admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t*
      EOF\n\t* EOF\n\t* EOF\n"
    reason: InitializeOffset
    status: "False"
    type: DependencyReady

Nothing has changed with our kafka cluster and we have several hundred other kafka consumers running off this same cluster without issue. Is there anything I can look into further to see any error logs from a running pod?

@pierDipi
Copy link
Member

Can you share / look at the kafka-controller-* pod logs?

@mdwhitley
Copy link
Author

controller logs are filled with

{"level":"error","ts":"2024-09-11T13:07:05.344Z","logger":"kafka-broker-controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"ff0b991-dirty","knative.dev/pod":"kafka-controller-77b6997bc8-2f6cd","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumergroup.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.ConsumerGroup","knative.dev/traceid":"ecfa7a7e-3e3a-493c-a596-f4e693c1283a","knative.dev/key":"conversation/knative-trigger-conversation-dev007-ke.assistant.clu-controller.v1.training-failed","duration":0.842944322,"error":"failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:491"}
{"level":"info","ts":"2024-09-11T13:07:05.345Z","logger":"kafka-broker-controller.event-broadcaster","caller":"record/event.go:364","msg":"Event(v1.ObjectReference{Kind:\"ConsumerGroup\", Namespace:\"conversation\", Name:\"knative-trigger-conversation-dev007-ke.assistant.clu-controller.v1.training-failed\", UID:\"f8acb1bb-8025-433c-9002-4d0c7c4f0443\", APIVersion:\"internal.kafka.eventing.knative.dev/v1alpha1\", ResourceVersion:\"5909367411\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","commit":"ff0b991-dirty","knative.dev/pod":"kafka-controller-77b6997bc8-2f6cd"}
{"level":"error","ts":"2024-09-11T13:07:05.411Z","logger":"kafka-broker-controller","caller":"consumergroup/reconciler.go:303","msg":"Returned an error","commit":"ff0b991-dirty","knative.dev/pod":"kafka-controller-77b6997bc8-2f6cd","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumergroup.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.ConsumerGroup","knative.dev/traceid":"dbda47c0-28db-4c19-bdbf-ce9b7472ace6","knative.dev/key":"conversation/knative-trigger-conversation-dev016-ke.assistant.clu-controller.v1.training-start","targetMethod":"ReconcileKind","error":"failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","stacktrace":"knative.dev/eventing-kafka-broker/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup.(*reconcilerImpl).Reconcile\n\tknative.dev/eventing-kafka-broker/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup/reconciler.go:303\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:491"}
{"level":"error","ts":"2024-09-11T13:07:05.411Z","logger":"kafka-broker-controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"ff0b991-dirty","knative.dev/pod":"kafka-controller-77b6997bc8-2f6cd","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumergroup.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.ConsumerGroup","knative.dev/traceid":"dbda47c0-28db-4c19-bdbf-ce9b7472ace6","knative.dev/key":"conversation/knative-trigger-conversation-dev016-ke.assistant.clu-controller.v1.training-start","duration":0.912392363,"error":"failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:491"}
{"level":"info","ts":"2024-09-11T13:07:05.412Z","logger":"kafka-broker-controller.event-broadcaster","caller":"record/event.go:364","msg":"Event(v1.ObjectReference{Kind:\"ConsumerGroup\", Namespace:\"conversation\", Name:\"knative-trigger-conversation-dev016-ke.assistant.clu-controller.v1.training-start\", UID:\"c8ca0592-a9d8-4cbd-9494-fd2f79bc5632\", APIVersion:\"internal.kafka.eventing.knative.dev/v1alpha1\", ResourceVersion:\"5909367245\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","commit":"ff0b991-dirty","knative.dev/pod":"kafka-controller-77b6997bc8-2f6cd"}

I've tried setting the kafka-controller POD_CAPACITY back to 2 (replicas we ran with in 1.13) and now:

$ k get statefulsets -n knative-eventing
NAME                      READY    AGE
kafka-broker-dispatcher   15/141   20h

I am not sure what is trying to scale dispatchers, but it is out of control.

@pierDipi
Copy link
Member

pierDipi commented Sep 11, 2024

you need to increase POD_CAPACITY if you want less dispatcher replicas, here is the document describing how that works https://docs.google.com/document/d/1UktwiDyqq07MtA7pUlahEpux5CCdAsyI6k3nkQeeqXw/edit

also described here how many replicas you get given number of triggers and pod capacity configuration #4091 (comment)

@mdwhitley
Copy link
Author

Yep, that did it. I increased POD_CAPACITY to 40. I had to manually scale down the statefulset to be able to edit the resources and for now it looks like the system is sticking with 15 replicas.

@pierDipi
Copy link
Member

do you still see the issue?

@mdwhitley
Copy link
Author

The cluster is not stable, no. Over the course of the past hour we've now had a handful of triggers become unready with BindInProgress, currently at 5/266 triggers unready.

  - lastTransitionTime: "2024-09-11T14:22:07Z"
    reason: BindInProgress
    status: "False"
    type: DependencyReady

@mdwhitley
Copy link
Author

mdwhitley commented Sep 11, 2024

Another 10 minutes and ~18 ddditional triggers are now also becoming unready with InitializeOffset same EOF errors as before.

View from kafka
image
~9:30am is when POD_CAPACITY was set to 40. ~11:30am is when I downgraded back to 1.13.

@mdwhitley
Copy link
Author

I spent some more time trying to track down those EOL errors and have discovered they have been happening since 8/31 in our cluster. 8/31 is when I attempted our most recent 1.14 upgrade and was what prompted opening this issue. Since that time, even after 1.13 downgrades, kafka-controller pods appear to start throwing these errors. On 1.13 during this time all triggers register as available and all events I have tested are continuing to flow.

I've engaged our Kafka service provider and their initial analysis of our cluster shows no issues. At this point the issue is directly linked to just Knative and began as part of the 1.14 upgrade attempt. So either Knative 1.14 bug, error with our configuration of this upgrade, or a combination of both. I've handled most all of our upgrades since 1.8.x, so this is very much an outlier in that we can't get anything to work/stay stable for more than a few hours.

@pierDipi
Copy link
Member

pierDipi commented Sep 12, 2024

Does it ever recovers from EOF errors? How long it takes to recover from EOF errors?

@mdwhitley
Copy link
Author

As far as I can tell, it does not automatically recover. I believe what I saw yesterday morning, with things initially working after deploying only to slowly have trigger readiness fail, is what happened over an extended period of time the previous night. Eventually no triggers are available and there are no events being received/sent by brokers.

I am in the process of setting up a new separate dev cluster to install 1.13 on and attempt the 1.14 upgrade there for a baseline comparison.

@mdwhitley
Copy link
Author

mdwhitley commented Sep 12, 2024

I have reproduced the EOF error on a new k8 cluster. Initially installed our prod 1.13 configuration with our tests passing and then attempted the first 1.14 upgrade on that cluster. It immediately began failing with the same errors

$ kgp
NAME                                          READY   STATUS    RESTARTS   AGE
eventing-controller-6cf6b45d64-59tbj          1/1     Running   0          9m29s
eventing-controller-6cf6b45d64-bqf6t          1/1     Running   0          8m58s
eventing-controller-6cf6b45d64-ngcng          1/1     Running   0          8m28s
eventing-webhook-cb66dcdd6-6f4v8              1/1     Running   0          9m24s
eventing-webhook-cb66dcdd6-ts2lv              1/1     Running   0          9m29s
kafka-broker-dispatcher-0                     1/1     Running   0          17m
kafka-broker-dispatcher-1                     1/1     Running   0          17m
kafka-broker-receiver-d95586c4d-gtnd9         1/1     Running   0          9m13s
kafka-broker-receiver-d95586c4d-vzpkr         1/1     Running   0          9m29s
kafka-controller-677b4fb4f4-68xzs             1/1     Running   0          9m29s
kafka-controller-677b4fb4f4-6qn2w             1/1     Running   0          9m24s
kafka-controller-677b4fb4f4-s9z4r             1/1     Running   0          9m19s
kafka-webhook-eventing-79b748cf55-cwmg5       1/1     Running   0          9m22s
kafka-webhook-eventing-79b748cf55-nmvs6       1/1     Running   0          9m29s
kafka-webhook-eventing-79b748cf55-v7grr       1/1     Running   0          9m18s
ke-sample-api-service-76f8c7d46f-7mp75        1/1     Running   0          9m28s
ke-sample-training-service-56f5c67c98-vbxhc   1/1     Running   0          9m28s
knative-sandbox-dlq-5dbbcb8595-ftsk7          1/1     Running   0          9m28s

The post-install job that is supposed to cleanup old dispatcher deployments still no-ops without doing anything. Still manually running:

 k delete deployment kafka-broker-dispatcher -n knative-eventing
$ k get brokers
NAME                     URL                                                                                                      AGE   READY   REASON
knative-sandbox-broker   http://kafka-broker-ingress.knative-eventing.svc.cluster.local/knative-eventing/knative-sandbox-broker   87m   True 
$ k get triggers
NAME                                     BROKER                   SUBSCRIBER_URI                                                    AGE   READY   REASON
sample-api-service-training-complete     knative-sandbox-broker                                                                     76m   False   InitializeOffset
sample-api-service-training-failed       knative-sandbox-broker   http://ke-sample-api-service.knative-eventing.svc.cluster.local   76m   False   InitializeOffset
sample-training-service-training-start   knative-sandbox-broker                                                                     76m   False   InitializeOffset
  - lastTransitionTime: "2024-09-12T19:59:44Z"
    message: "failed to initialize consumer group offset: cannot obtain Kafka cluster
      admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t*
      EOF\n\t* EOF\n\t* EOF\n"
    reason: InitializeOffset
    status: "False"
    type: DependencyReady

I do still observe events being able to be sent to both sample-training-service-training-start and sample-api-service-training-complete triggers despite them not being in a Ready state. I can leave it on this cluster over night to see what happens, but we still need to resolve the latent errors and issue with triggers readiness.

If it is helpful I can bundle up our entire config (eventing-core + kafka templates) and provide it internally for review.

@mdwhitley
Copy link
Author

mdwhitley commented Sep 13, 2024

After 12 hours the triggers have not changed and are stilled marked as not ready due to the InitializeOffset / EOF error. The kafka instance used for this deployment is the same as our primary dev cluster. During the 12 hours we had around 15-20,000 non-knative events successfully flowing through kafka in addition to ~200 knative events via our main dev 1.13 cluster.

I have also experimented with pointing this isolated cluster to use our staging kafka instance. The same error shows up immediately in the kafka-controller and eventing is down. The dispatcher statefulsets, after switching kafka, will not restart (gave up waiting after 30 minutes) and are just hung on ContainerCreating because of

MountVolume.SetUp failed for volume "contract-resources" : configmap "kafka-broker-dispatcher-1" not found

Our staging kafka instance has less total topics/partitions, but is much larger and drives throughput (around 20,000 non-knative events/minute) that is closer to our production clusters.

@pierDipi
Copy link
Member

Backport to speed up mount volume #4101, I think that will help on the latest issue MountVolume.SetUp failed for volume "contract-resources" : configmap "kafka-broker-dispatcher-1" not found

@pierDipi
Copy link
Member

With #4103, once backported to 1.14, we can try to get more debug logs about the Kafka client connections or disable the new client pooling

ENABLE_SARAMA_LOGGER -> "true"
ENABLE_SARAMA_DEBUG_LOGGER -> "true"
ENABLE_SARAMA_CLIENT_POOL -> "false"

@pierDipi
Copy link
Member

New version with #4091 (comment) is out https://github.com/knative-extensions/eventing-kafka-broker/releases/tag/knative-v1.14.11

@mdwhitley
Copy link
Author

I tried out 1.14.11 on my test cluster and have promising results. I pulled resource updates to our k8 templates from the release and things immediately began failing with EOF again, but I noticed that the yaml resources didn't have any of the new ENVs. I added them into the controller deployment:

            - name: ENABLE_SARAMA_LOGGER
              value: "false"
            - name: ENABLE_SARAMA_DEBUG_LOGGER
              value: "false"
            - name: ENABLE_SARAMA_CLIENT_POOL
              value: "true"

which has resulted in a stable cluster so far. Deployment of our test broker/trigger/services all came up and eventing is flowing with none of the previous EOF errors. I do see a few of

E0918 15:56:38.288958       1 leaderelection.go:308] Failed to release lock: Put "https://172.21.0.1:443/apis/coordination.k8s.io/v1/namespaces/knative-eventing/leases/kafka-broker-controller.knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.broker.reconciler.00-of-01": unexpected EOF

but thus far the triggers and eventing have remained available. I will continue to let this sit for a day in this test environment before I wipe it out and start back with our base 1.13 and try an upgrade path. Is there anything I could get/try for you in the meantime?

@mdwhitley
Copy link
Author

After around 1-2 hours triggers became not ready. Same old EOF errors spewing in kafka-controller pods.

I modified kafka-controller to set:

ENABLE_SARAMA_LOGGER -> "true"
ENABLE_SARAMA_DEBUG_LOGGER -> "true"
ENABLE_SARAMA_CLIENT_POOL -> "false"

When everything came back up, triggers went back to ready state. Another hour or so later and triggers are again not ready. I've captured the latest logs and sent them along via email.

@pierDipi
Copy link
Member

pierDipi commented Sep 19, 2024

The EOF on the leaderelection is a little odd.


I modified kafka-controller to set:

ENABLE_SARAMA_LOGGER -> "true"
ENABLE_SARAMA_DEBUG_LOGGER -> "true"
ENABLE_SARAMA_CLIENT_POOL -> "false"

This is not yet released and merged, in the previous reply I was referring to #4091 (comment):

Backport to speed up mount volume #4101, I think that will help on the latest issue MountVolume.SetUp failed for volume "contract-resources" : configmap "kafka-broker-dispatcher-1" not found

@elebiodaslingshot
Copy link

Same problem is happening in 1.15

@elebiodaslingshot
Copy link

Id like to point out that my problem stemmed from running knative kafka across two different clusters. Each knative instance resulted in duplicate consumergroups for two different topics. This then resulted in consumers never committing offsets or polling messages. Eventually leading to all triggers being taken down.

To fix this I changed the template from: knative-trigger-{{ .Namespace }}-{{ .Name }} -> knative-trigger-{{ .Namespace }}-{{ .Name }}-{{ .Environment }} which results in unique consumer groups.

I then changed the offset to earliest to make sure we didnt lose any messages and then re ran our cicd pipelines to re-create the triggers after I deleted them.

This has resulted in no longer seeing:

failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: EOF

What lead me to this path was this log line:

{"@timestamp":"2024-09-20T16:33:34.693Z","@version":"1","message":"[Consumer clientId=REDACTED, groupId=REDACTED] Request joining group due to: encountered UNKNOWN_MEMBER_ID from JOIN_GROUP response","logger_name":"org.apache.kafka.clients.consumer.internals.ConsumerCoordinator","thread_name":"Thread-21","level":"INFO","level_value":20000}

@mdwhitley
Copy link
Author

Our dev configurations are similar to what @elebiodaslingshot described, but we do not have any JOIN_GROUP errors. The closest I have seen is

Request joining group due to: encountered UNKNOWN_MEMBER_ID from HEARTBEAT response

showed up once over the past week and does not appear to correlate with our trigger readiness.

That said, the scenario makes a lot of sense. We have, starting very recently, begun running multiple dev knative clusters off the same Kafka instance. Those were on 1.13 and seem to tolerate this issue as there is no evidence of it impacting eventing. While some of our triggers are ancient and still use GUID consumer groups, the majority would use the triggers-consumergroup-template which could impact us. We encode service/tenant information into trigger names, so the majority of trigger consumer groups are unique by default, but there are a handful per cluster that overlap and would logically hit weird issues like this.

I've made similar adjustments to include environment details in triggers-consumergroup-template and I managed to get things up and running. Over the past hour I did see a batch of EOF errors but no indication that triggers went down. I plan to let this bake over the weekend and if things go well try it in our primary dev cluster.

@pierDipi
Copy link
Member

pierDipi commented Sep 23, 2024

I've being trying to reproduce for a while with varying level of "scale", the only thing I found was a rather slow time to recover in the scheduler / autoscaler (see here for more details: knative/eventing#8200) but I could not reproduce the EOF errors.

@mdwhitley
Copy link
Author

After updating my 1.14 test cluster with guaranteed unique consumer group template I can still see my original issue and triggers continue to flip from ready to not ready over time. I noticed triggers reporting not ready over the weekend and we still see EOF errors being thrown at regular intervals
image

@mdwhitley
Copy link
Author

I've tried a as-vanilla-as-possible install of 1.14 release and still see the issue presenting. Today the EOF errors are occurring at 1 hour intervals with just 1 broker / 3 triggers created
image

While the 1.14 Knative install hits this issue and the triggers go unavailable, our 1.13 Knative cluster (on the same Kafka instance) continues to work fine along with a handful of other service stacks (altogether creating a couple hundred consumer groups). I've spoken with some of those teams, and at least one runs Go/Sarama as well. During the times where Knative is throwing EOF errors, their consumers are all up with no sign of the errors in their stack.

Looking through sarama, that library has had several issues opened related to this type of error with no resolution by that team.

@mdwhitley
Copy link
Author

@pierDipi is there an eta on new 1.14 release containing #4103?

I've continued to test with 1.14 without much change. Initially I did find some different errors when running kafka-controller with debug logging. Instead of EOF received connect: no route to host errors. Triaged that down to a subnet missing on some worker nodes in the new k8 cluster I am using, but once fixed I'm back to just the EOF errors with nothing else jumping out from debug logging. Double verified that when EOF errors are getting thrown, connectivity to the same Kafka IPs in the kafka-controller pod itself is still working fine.

I've attempted to jump to 1.15 to get access to the sarama logging ENVs but that fairs even worse. I am able to get services, brokers and triggers all up but no events are flowing through KE stack. I haven't spent much time digging through that release, but it looks like changes are sufficient that our existing configs are broken.

I've also had some discussions with other internal teams. We have a component that also interfaces with Kafka written in Go and uses the Sarama kafka client. They have taken a look at the periods where KE is receiving EOF errors and have been able to confirm that there were no interruptions within their stack during the same period. This component stack handles ~200x the kafka throughput that the KE stack sees, so it is a good canary in this case.

@pierDipi
Copy link
Member

Note also that there is the template for the topic name that is based solely on namespace and name

triggers-consumergroup-template: "knative-trigger-{{ .Namespace }}-{{ .Name }}"
brokers-topic-template: "knative-broker-{{ .Namespace }}-{{ .Name }}"
channels-topic-template: "knative-messaging-kafka.{{ .Namespace }}.{{ .Name }}"
and maybe using the same Kafka cluster for 2 different Knative installations could cause issues

@pierDipi
Copy link
Member

@pierDipi is there an eta on new 1.14 release containing #4103?

It will be released shortly once one of the jobs here succeeds https://prow.knative.dev/?type=periodic&job=release_eventing-kafka-broker_release-1.14_periodic

@mdwhitley
Copy link
Author

I deployed 1.14.12 with sarama debug logging turned on and observed the error when triggers go down after working for a while:

[sarama][debug] 2024/09/26 18:09:47 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-0:9093
[sarama][debug] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-2:9093
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker 2 while fetching metadata: write tcp x.x.x.x:52230->x.x.x.x:9093: write: broken pipe
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/broker.go:344: Error while closing connection to broker kafka-2:9093: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp x.x.x.x:52230->x.x.x.x:9093: write: broken pipe
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:774: client/brokers deregistered broker #2 at kafka-2:9093
[sarama][debug] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-0:9093
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker 0 while fetching metadata: EOF
[sarama][debug] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/broker.go:342: Closed connection to broker kafka-0:9093
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:774: client/brokers deregistered broker #0 at kafka-0:9093
[sarama][debug] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-1:9093
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker 1 while fetching metadata: write tcp x.x.x.x:51990->x.x.x.x:9093: write: broken pipe
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/broker.go:344: Error while closing connection to broker kafka-1:9093: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp x.x.x.x:51990->x.x.x.x:9093: write: broken pipe
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:774: client/brokers deregistered broker #1 at kafka-1:9093
[sarama][debug] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-0:9093
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: write tcp x.x.x.x:41544->x.x.x.x:9093: write: broken pipe
[sarama] 2024/09/26 18:19:47 github.com/IBM/sarama@v1.43.1/broker.go:344: Error while closing connection to broker kafka-0:9093: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp x.x.x.x:41544->x.x.x.x:9093: write: broken pipe

which seems to indicate Knative/Sarama client is not properly closing/refreshing connections.

I have also observed the same EOF error but on start up of kafka-controller pods. We run with 3 replicas and during this error, when the pod come up at least 1 out of the 3 pods will immediately hit the EOF errors. This brings the triggers down and as far as I have seen the pod never recovers. The other 2 pods are up and communicate to kafka fine. The impacted pod has full network connectivity to the kafka instance. Most importantly all 3 pods have the same kafka config... so 2/3 connecting with one failing SASL handshake is supicious.

[sarama] 2024/09/26 18:58:52 github.com/IBM/sarama@v1.43.1/config.go:618: ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama][debug] 2024/09/26 18:58:52 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-1:9093
[sarama] 2024/09/26 18:58:52 github.com/IBM/sarama@v1.43.1/broker.go:1250: Error while performing SASL handshake kafka-1:9093
[sarama][debug] 2024/09/26 18:58:52 github.com/IBM/sarama@v1.43.1/broker.go:267: Closed connection to broker kafka-1:9093
[sarama] 2024/09/26 18:58:52 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/26 18:58:52 github.com/IBM/sarama@v1.43.1/client.go:1086: client/metadata no available broker to send metadata request to
[sarama] 2024/09/26 18:58:52 github.com/IBM/sarama@v1.43.1/client.go:788: client/brokers resurrecting 3 dead seed brokers

I also tried with

      ENABLE_SARAMA_LOGGER:                       true
      ENABLE_SARAMA_DEBUG_LOGGER:                 true
      ENABLE_SARAMA_CLIENT_POOL:                  false

which resulted in the latter example of pods coming up with immediate EOF errors and down eventing.

@pierDipi
Copy link
Member

I have also observed the same EOF error but on start up of kafka-controller pods. We run with 3 replicas and during this error, when the pod come up at least 1 out of the 3 pods will immediately hit the EOF errors. This brings the triggers down and as far as I have seen the pod never recovers. The other 2 pods are up and communicate to kafka fine

Can you share the leases in the knative-eventing namespace to identify the leader pod we need to focus on?

kubectl get leases -n knative-eventing -oyaml

@pierDipi
Copy link
Member

A few other people are encountering EOF here with sarama 1.43 and Go 1.122 IBM/sarama#2938

@pierDipi
Copy link
Member

The issue I'm having is that sarama is never logging the actual error associated with "Error while performing SASL handshake kafka-1:9093"

https://github.com/IBM/sarama/blob/893978c87fe7af13a2b6849ba62b003493f97f25/broker.go#L1243-L1252
https://github.com/IBM/sarama/blob/893978c87fe7af13a2b6849ba62b003493f97f25/broker.go#L260-L275

I also tried with

      ENABLE_SARAMA_LOGGER:                       true
      ENABLE_SARAMA_DEBUG_LOGGER:                 true
      ENABLE_SARAMA_CLIENT_POOL:                  false

which resulted in the latter example of pods coming up with immediate EOF errors and down eventing.

For now, let's keep this configuration and continue gather logs and insights as it helps with reducing the moving parts

@pierDipi
Copy link
Member

Can you share the Kafka cluster version you're connecting to?

@mdwhitley
Copy link
Author

Can you share the leases in the knative-eventing namespace to identify the leader pod we need to focus on?

Snapshot of kubectl get leases -n knative-eventing -oyaml: leases.txt

Can you share the Kafka cluster version you're connecting to?

Our Kafka clusters are version 3.6.

I've tried deploying 1.14 with increased req/lim CPU resources for kafka-controller, but I do not believe that is realistically the issue. Our monitoring shows the kafka-controller pods using < 0.1m CPU during these EOF issues with no CPU throttling.

The EOF error is regularly associated with client/metadata sarama calls and looks like when the issue is hit no attempt at reconnecting is made (or at least not logged). eg:

[sarama] 2024/09/27 15:46:51 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/27 15:46:51 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/27 15:46:51 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/27 15:46:51 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/27 15:46:51 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/27 15:46:51 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
{"level":"error","ts":"2024-09-27T15:46:51.412Z","logger":"kafka-broker-controller","caller":"consumergroup/reconciler.go:303","msg":"Returned an error","commit":"1754676-dirty","knative.dev/pod":"kafka-controller-775b4d995b-xg8vj","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumergroup.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.ConsumerGroup","knative.dev/traceid":"4fc6748b-0928-4dec-9e60-9545a79fafcf","knative.dev/key":"knative-eventing/knative-trigger-some-k8-cluster-knative-eventing-sample-api-service-training-complete","targetMethod":"ReconcileKind","error":"failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","stacktrace":"knative.dev/eventing-kafka-broker/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup.(*reconcilerImpl).Reconcile\n\tknative.dev/eventing-kafka-broker/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup/reconciler.go:303\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:491"}
{"level":"error","ts":"2024-09-27T15:46:51.412Z","logger":"kafka-broker-controller","caller":"controller/controller.go:566","msg":"Reconcile error","commit":"1754676-dirty","knative.dev/pod":"kafka-controller-775b4d995b-xg8vj","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumergroup.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.ConsumerGroup","knative.dev/traceid":"4fc6748b-0928-4dec-9e60-9545a79fafcf","knative.dev/key":"knative-eventing/knative-trigger-some-k8-cluster-knative-eventing-sample-api-service-training-complete","duration":0.850037078,"error":"failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/controller/controller.go:491"}
{"level":"info","ts":"2024-09-27T15:46:51.412Z","logger":"kafka-broker-controller.event-broadcaster","caller":"record/event.go:364","msg":"Event(v1.ObjectReference{Kind:\"ConsumerGroup\", Namespace:\"knative-eventing\", Name:\"knative-trigger-some-k8-cluster-knative-eventing-sample-api-service-training-complete\", UID:\"13af28dd-68a3-4259-af1c-477f01688b5c\", APIVersion:\"internal.kafka.eventing.knative.dev/v1alpha1\", ResourceVersion:\"1412480808\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' failed to initialize consumer group offset: cannot obtain Kafka cluster admin, kafka: client has run out of available brokers to talk to: 3 errors occurred:\n\t* EOF\n\t* EOF\n\t* EOF\n","commit":"1754676-dirty","knative.dev/pod":"kafka-controller-775b4d995b-xg8vj"}

In cases where the connection is failed, the client should be destroyed and recreated and in this case it looks like it just continues to try and reuse a dead connection. My understanding of our internal components using Go/Sarama is that they required additional resiliency added to support production level scale. Additionally, there may need to be tweaks to Sarama client configs (timeouts and whatnot) to handle large scale clusters with hundreds of topics/thousands of partitions like what we run.

On the subject of large kafka clusters, we provisioned a brand new Kafka 3.6 cluster today and deployed 1.14.12 using it. EOF errors continue to be thrown despite using a kafka cluster with 1 topic and 2 partitions and only this one KE cluster using it.

@pierDipi
Copy link
Member

@mdwhitley did you disable the client pool #4091 (comment)?

@mdwhitley
Copy link
Author

Yes. I have tried with ENABLE_SARAMA_CLIENT_POOL=false, and I see the same problems.

Over the weekend I left the test cluster on 1.14 w/ client pool disabled and all triggers are currently marked as Not Ready/InitializeOffset. 1/3 kafka-controller pods is throwing constant EOF errors while the other 2 pods have no errors. Despite triggers reporting not ready, I am able to send events to them and they get routed to our receivers. Example error logs:

[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/client.go:1012: client/metadata retrying after 250ms... (0 attempts remaining)
[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/config.go:618: ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama][debug] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-2:9093
[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/broker.go:1250: Error while performing SASL handshake kafka-2:9093
[sarama][debug] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/broker.go:267: Closed connection to broker kafka-2:9093
[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/config.go:618: ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama][debug] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-0:9093
[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/broker.go:1250: Error while performing SASL handshake kafka-0:9093
[sarama][debug] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/broker.go:267: Closed connection to broker kafka-0:9093
[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/config.go:618: ClientID is the default of 'sarama', you should consider setting it to something application-specific.
[sarama][debug] 2024/09/30 16:37:29 github.com/IBM/sarama@v1.43.1/client.go:1027: client/metadata fetching metadata for all topics from broker kafka-1:9093
[sarama] 2024/09/30 16:37:30 github.com/IBM/sarama@v1.43.1/broker.go:1250: Error while performing SASL handshake kafka-1:9093
[sarama][debug] 2024/09/30 16:37:30 github.com/IBM/sarama@v1.43.1/broker.go:267: Closed connection to broker kafka-1:9093
[sarama] 2024/09/30 16:37:30 github.com/IBM/sarama@v1.43.1/client.go:1073: client/metadata got error from broker -1 while fetching metadata: EOF
[sarama] 2024/09/30 16:37:30 github.com/IBM/sarama@v1.43.1/client.go:1086: client/metadata no available broker to send metadata request to
[sarama] 2024/09/30 16:37:30 github.com/IBM/sarama@v1.43.1/client.go:788: client/brokers resurrecting 3 dead seed brokers
[sarama][debug] 2024/09/30 16:37:30 github.com/IBM/sarama@v1.43.1/client.go:309: Closing Client

@pierDipi
Copy link
Member

pierDipi commented Oct 1, 2024

Sarama error logging issue: IBM/sarama#2994

My understanding of our internal components using Go/Sarama is that they required additional resiliency added to support production level scale. Additionally, there may need to be tweaks to Sarama client configs (timeouts and whatnot) to handle large scale clusters with hundreds of topics/thousands of partitions like what we run.

Thanks for the hints, contributions or specific suggestions are really welcome in this area

@pierDipi
Copy link
Member

pierDipi commented Oct 1, 2024

In cases where the connection is failed, the client should be destroyed and recreated and in this case it looks like it just continues to try and reuse a dead connection

If you disable client pool, we create a new client each time we reconcile an object

@pierDipi
Copy link
Member

pierDipi commented Oct 1, 2024

From IBM/sarama#2981, it seems that EOF is the error got when there is a protocol mismatch

@mdwhitley
Copy link
Author

From IBM/sarama#2981, it seems that EOF is the error got when there is a protocol mismatch

That and SASL configuration error. If there was a misconfiguration, then I would expect no eventing to ever work which is simply not the case in our environments. Same config is used everywhere and works except for knative 1.14. Exact same config works on 1.13 without any issues with triggers going unready.

Beyond the Sarama pool/logging ENV, are there any ways to influence the kafka properties used for the sarama client?

@mdwhitley
Copy link
Author

Release 1.14.13 which includes #4125 resolves the issues we have been experiencing. All existing triggers continue to be available/ready and new triggers can be created successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants