Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker remains ready when Kafka is gone #1152

Open
matzew opened this issue Aug 20, 2021 · 7 comments
Open

Broker remains ready when Kafka is gone #1152

matzew opened this issue Aug 20, 2021 · 7 comments
Labels
area/control-plane kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Issues which should be fixed (post-triage)

Comments

@matzew
Copy link
Contributor

matzew commented Aug 20, 2021

Describe the bug

I have a running/working broker, but when I remove my Kakfa installation it remains READY:

oc get brokers --all-namespaces 
NAMESPACE   NAME          URL                                                                                AGE    READY   REASON
default     my-broker     http://kafka-broker-ingress.knative-eventing.svc.cluster.local/default/my-broker   15m    True    

The controller has this in logs:

{"level":"error","ts":"2021-08-19T14:55:21.503Z","logger":"kafka-broker-controller","caller":"controller/controller.go:565","msg":"Reconcile error","knative.dev/pod":"kafka-controller-549b8ff7c6-jml9s","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.broker.Reconciler","knative.dev/kind":"eventing.knative.dev.Broker","duration":0.816548862,"error":"failed to create topic: knative-broker-default-my-brokerrr: failed to create cluster admin: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20210818135208-7b5ecbc0e477/controller/controller.go:565\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20210818135208-7b5ecbc0e477/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20210818135208-7b5ecbc0e477/controller/controller.go:477"}
...

Expected behavior
It's not READY

Additional context
See also: knative-extensions/eventing-kafka#760

@matzew matzew added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2021
@pierDipi pierDipi self-assigned this Aug 20, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2021
@pierDipi pierDipi added triage/accepted Issues which should be fixed (post-triage) and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 19, 2021
@pierDipi pierDipi removed their assignment Jan 18, 2022
@matzew
Copy link
Contributor Author

matzew commented Mar 7, 2022

Also, a curl to the endpoint, does eventually return as

< HTTP/1.1 503 Service Unavailable

@pierDipi
Copy link
Member

pierDipi commented Mar 7, 2022

This is the general problem of reconciling external systems.

The only simple solution is to decrease the resync period to reduce the window time in which the broker remains ready but it's highly discouraged since it increases API server load.

Data plane metrics help to increase visibility in this.

@pierDipi
Copy link
Member

pierDipi commented Mar 7, 2022

/priority important-longterm

@knative-prow-robot knative-prow-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 7, 2022
@matzew
Copy link
Contributor Author

matzew commented Mar 7, 2022

What is good IMO is that eventually you do receive a 503 ... So external systems talking to it get that it is not ready

@matzew
Copy link
Contributor Author

matzew commented Mar 7, 2022

{"@timestamp":"2022-03-07T08:51:43.935Z","@version":"1","message":"[Producer clientId=producer-1] Error connecting to node my-cluster-kafka-1.my-cluster-kafka-brokers.kafka.svc:9092 (id: 1 rack: null)","logger_name":"org.apache.kafka.clients.NetworkClient","thread_name":"kafka-producer-network-thread | producer-1","level":"WARN","level_value":30000,"stack_trace":"java.net.UnknownHostException: my-cluster-kafka-1.my-cluster-kafka-brokers.kafka.svc\n\tat java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source)\n\tat java.base/java.net.InetAddress.getAllByName0(Unknown Source)\n\tat java.base/java.net.InetAddress.getAllByName(Unknown Source)\n\tat java.base/java.net.InetAddress.getAllByName(Unknown Source)\n\tat org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)\n\tat org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:111)\n\tat org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:513)\n\tat org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:467)\n\tat org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:172)\n\tat org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:985)\n\tat org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:73)\n\tat org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1158)\n\tat org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1046)\n\tat org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:559)\n\tat org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:327)\n\tat org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:242)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n"}

on the receiver pod

@pierDipi
Copy link
Member

pierDipi commented Mar 7, 2022

This is what happens when I delete the Strimzi cluster during a test (10:43):

connection count drops to 0 and closed connections count increases

image

image

mgencur pushed a commit to mgencur/eventing-kafka-broker that referenced this issue Aug 19, 2024
Signed-off-by: red-hat-konflux <126015336+red-hat-konflux[bot]@users.noreply.github.com>
Co-authored-by: red-hat-konflux[bot] <126015336+red-hat-konflux[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

3 participants