Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka event bus reconnect attempt logic has some issue , its never succeeding and ultimately Sensor gets halted #2711

Closed
gyanprakash48 opened this issue Jul 13, 2023 · 10 comments
Labels
bug Something isn't working stale

Comments

@gyanprakash48
Copy link

Sensor logs say below but reconnect attempt never succeed though terminating and restarting the pod immediately fix the issue, == This indicates reconnect logic has some issue =====

2023-07-13T07:39:00.541Z INFO argo-events.sensor sensors/listener.go:302 EventBus connection lost, reconnecting... {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger"}
2023-07-13T07:39:00.541Z INFO argo-events.sensor sensors/listener.go:308 reconnected to EventBus. {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger", "connection": "KafkaTriggerConnection{Sensor:staging.dyper-recompute.sensor,Trigger:http-trigger}"}
2023-07-13T07:39:00.541Z DEBUG argo-events.sensor sensors/listener.go:316 sublock not acquired {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger"}
2023-07-13T07:39:00.542Z INFO argo-events.sensor sensors/listener.go:277 started subscribing to events for trigger http-trigger with client connection KafkaTriggerConnection{Sensor:staging.dyper-recompute.sensor,Trigger:http-trigger} {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger"}
2023-07-13T07:39:00.542Z INFO argo-events.sensor sensor/kafka_sensor.go:203 Consuming {"sensorName": "staging.dyper-recompute.sensor", "topics": ["staging.dyper-recompute.eventbus", "staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-trigger", "staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-action"], "group": "staging.dyper-recompute.eventbus.listner"}
2023-07-13T07:39:00.558Z INFO argo-events.sensor sensor/kafka_handler.go:75 Kafka setup {"sensorName": "staging.dyper-recompute.sensor", "claims": {"staging.dyper-recompute.eventbus":[0,1,2,3,4,5],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-action":[0,1,2],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-trigger":[0,1,2]}}
2023-07-13T07:39:00.564Z INFO argo-events.sensor sensor/kafka_handler.go:124 Kafka cleanup {"sensorName": "staging.dyper-recompute.sensor", "claims": {"staging.dyper-recompute.eventbus":[0,1,2,3,4,5],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-action":[0,1,2],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-trigger":[0,1,2]}}
2023-07-13T07:39:00.564Z ERROR argo-events.sensor sensor/kafka_sensor.go:215 Failed to consume {"sensorName": "staging.dyper-recompute.sensor", "error": "kafka: response did not contain all the expected topic/partition blocks"}
github.com/argoproj/argo-events/eventbus/kafka/sensor.(*KafkaSensor).Listen
/home/runner/work/argo-events/argo-events/eventbus/kafka/sensor/kafka_sensor.go:215

2023-07-13T07:39:05.541Z INFO argo-events.sensor sensors/listener.go:302 EventBus connection lost, reconnecting... {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger"}
2023-07-13T07:39:05.541Z INFO argo-events.sensor sensors/listener.go:308 reconnected to EventBus. {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger", "connection": "KafkaTriggerConnection{Sensor:staging.dyper-recompute.sensor,Trigger:http-trigger}"}
2023-07-13T07:39:05.541Z DEBUG argo-events.sensor sensors/listener.go:311 acquired sublock, instructing trigger to shutdown subscription {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger"}
2023-07-13T07:39:05.541Z DEBUG argo-events.sensor sensors/listener.go:285 exiting subscribe goroutine, conn=KafkaTriggerConnection{Sensor:staging.dyper-recompute.sensor,Trigger:http-trigger} {"sensorName": "staging.dyper-recompute.sensor", "triggerName": "http-trigger"}
2023-07-13T07:39:05.541Z INFO argo-events.sensor sensor/kafka_sensor.go:203 Consuming {"sensorName": "staging.dyper-recompute.sensor", "topics": ["staging.dyper-recompute.eventbus", "staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-trigger", "staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-action"], "group": "staging.dyper-recompute.eventbus.listner"}
2023-07-13T07:39:05.555Z INFO argo-events.sensor sensor/kafka_handler.go:75 Kafka setup {"sensorName": "staging.dyper-recompute.sensor", "claims": {"staging.dyper-recompute.eventbus":[0,1,2,3,4,5],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-action":[0,1,2],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-trigger":[0,1,2]}}
2023-07-13T07:39:05.556Z INFO argo-events.sensor sensor/kafka_handler.go:124 Kafka cleanup {"sensorName": "staging.dyper-recompute.sensor", "claims": {"staging.dyper-recompute.eventbus":[0,1,2,3,4,5],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-action":[0,1,2],"staging.dyper-recompute.eventbus-staging.dyper-recompute.sensor-trigger":[0,1,2]}}
2023-07-13T07:39:05.557Z ERROR argo-events.sensor sensor/kafka_sensor.go:215 Failed to consume {"sensorName": "staging.dyper-recompute.sensor", "error": "kafka: response did not contain all the expected topic/partition blocks"}
github.com/argoproj/argo-events/eventbus/kafka/sensor.(*KafkaSensor).Listen

@gyanprakash48 gyanprakash48 added the bug Something isn't working label Jul 13, 2023
@gyanprakash48
Copy link
Author

it could be also that issue is rooted to ""github.com/Shopify/sarama" as i see many discussion for error": "kafka: response did not contain all the expected topic/partition blocks.

@gyanprakash48
Copy link
Author

@bilalba @dfarr any quick hint ?

@dfarr
Copy link
Member

dfarr commented Jul 31, 2023

If connection to kafka fails for whatever reason, the Sensor will continue to attempt to reconnect in a loop forever.

I'm wondering if perhaps we should give up on the connection at some point and allow the process to fail. This would trigger a pod restart which sounds like it should resolve the issue. I'm also curious about the underlying problem, do you have any inclination as to why you are seeing this error?

kafka: response did not contain all the expected topic/partition blocks

@dfarr
Copy link
Member

dfarr commented Jul 31, 2023

This thread is interesting. I'm wondering if we need to filter out some messages if it's possible for our client to receive messages out-of-order. Do you use log compaction @gyanprakash48?

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had
any activity in the last 60 days. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 30, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 7, 2023
@pdellarciprete
Copy link

Hey @dfarr, I am having the same issue here. Did you find a way to trigger a restart when the Sensor continue to attempt to reconnect in a loop forever?

@pdellarciprete
Copy link

@gyanprakash48 did you find a solution for it?

@Regulus-Regulus
Copy link

Hi @pdellarciprete, have you found a solution for this problem in the meantime?

@pavan02
Copy link

pavan02 commented Nov 26, 2024

Has anyone found any workaround for this yet?

@Regulus-Regulus
Copy link

@pavan02 The workaround we currently use is a cronjob that regularly restarts the sensor pods. It's far from perfect, but the issue hasn't come up again. We were considering implementing a way to add a sidecar to the Sensor-Pods that fails the pod when the messages occur, but that requires additional tooling to inject the sidecar and we haven't had the time/resources to implement that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

5 participants