-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kafka
sink thread stuck in infinite loop
#1818
Comments
Thanks @linshaoyong, we'll take a look and see what's going on. Appreciate the detailed issue. |
kafka
sink thread stuck in infinite loop
@linshaoyong #1830 has been merged. This should resolve your issue. That PR adds the |
@binarylogic Thanks. I will try today, the problem can't be reproduced quickly, it will take a few days before i can respond. |
@linshaoyong no problem. My hope is that the |
@binarylogic hi, maybe I should provide more information. vector configuration:
vector.log:
top -n 1 -H -p [pid]
agent(9.9.9.9) keep connection with each kafka broker
On Kafka brokers, broker4369 and broker4370 have no connection with agent(9.9.9.9) broker4369: broker4370: broker4371: Kafka broker closed the connection but the client didn‘t know. in this case, the two CPUs run to 100%. When we have thousands of agents, this happens occasionally. Can vector reconnect or make other improvements for this situation? Thank you |
Sorry, my initial guess was misleading, maybe this issue should help. |
@linshaoyong Could you please try adding [sinks.log_sink]
type = "kafka"
inputs = ["log_source"]
bootstrap_servers = ["1.1.1.1:9093","1.1.1.2:9093","1.1.1.3:9093","1.1.1.4:9093"]
[sinks.log_sink.librdkafka_options]
"reconnect.backoff.ms" = "2000" The default reconnect backoff timeout in rdkafka is 100ms and such a small timeout does create CPU load with this timeout. On the other hand, this guarantees that the reconnect would happen as fast as possible. Setting |
@a-rodin I have deployed the latest nightly on one node, about 30 agents will be deployed today.
|
So just to ensure, is it still observed when |
@a-rodin yes, |
confluentinc/librdkafka#2363 (comment) suggests that the problem could have been fixed in |
@a-rodin Today I deployed 23 agents, using vector 0.9.0-nightly (g560fd10 x86_64-unknown-linux-musl 2020-03-02), 3 of them set the environment variable LOG = "vector = debug, rdkafka = trace". If there is new progress, I'll come back. update: An agent use almost 400% cpu,below is the log of rdkafka.
|
I think I know what is the reason: your Kafka version is 0.11.0.0 and our integration tests currently use Kafka 2.12, earlier versions are known to be supported by |
Closing since we have upgraded rdkafka which should resolve this issue. |
Hey folks, I still got this issue occasionally with the below errors. Can someone help me understand what is the root cause of this?
Here is my vector config and I am using vector sinks:
msk:
type: kafka
inputs:
- k8s_log
topic: k8s_log_snap
bootstrap_servers: b-3.stg-msk-cluster-c.*****.c2.kafka.ap-southeast-1.amazonaws.com:9094
compression: snappy
tls:
enabled: true
encoding:
codec: "json"
batch:
max_bytes: 600000
timeout_secs: 30 |
It sounds like you may need to increase the message timeout ( |
Thanks @jszwedko for quick response, I already checked the default value of |
Ah, you are right, five minutes does feel quite long. Googling around a bit seems to indicate you should check the cluster/broker health to identify why it is taking so long to receive the message. |
Thanks @jszwedko , could I know if vector support that mechanism? Does tweaking the config |
Ah, no, I mean look at the health of the cluster itself. There are some suggestions on https://www.redpanda.com/guides/kafka-performance-kafka-monitoring, for example. It's been a while since I've administered a Kafka cluster and so I don't remember offhand the things to look at. The error you are seeing may mean that the cluster is overloaded or unhealthy. |
You could try, but I don't immediately see a relationship. |
Thanks for your help @jszwedko , will try it 🙇 |
After running about 20 days, some vector thread use almost 100% cpu.
we have 1000+ agents, about 10 events per second per agent.
only 10+ agents have this problem, most likely caused by a high-latency network.
strace -p 2731697
this log stuck in infinite loop
I ended up restarting the process, but the problem reoccurred a few days later
It's similar with this issue, librdkafka 1397,but this issue had been fixed by librdkafka team
whether libkafka is not configured correctly in vector, such as
socket.timeout.ms
The text was updated successfully, but these errors were encountered: