Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka sink thread stuck in infinite loop #1818

Closed
linshaoyong opened this issue Feb 17, 2020 · 22 comments
Closed

kafka sink thread stuck in infinite loop #1818

linshaoyong opened this issue Feb 17, 2020 · 22 comments
Labels
domain: observability Anything related to monitoring/observing Vector sink: kafka Anything `kafka` sink related type: bug A code related bug.

Comments

@linshaoyong
Copy link

linshaoyong commented Feb 17, 2020

After running about 20 days, some vector thread use almost 100% cpu.
we have 1000+ agents, about 10 events per second per agent.
only 10+ agents have this problem, most likely caused by a high-latency network.

  • Operating system: CentOS 7
  • Vector: 0.6.0, 1000+ agents
  • Kafka: 0.11.0.0, 4 brokers
PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND 
2731697 root      20   0  113164   8592   1476 R 93.3  0.0  15973:09 rdk:broker4370 
2731680 root      20   0  113164   8592   1476 S  0.0  0.0   0:00.40 vector 
2731687 root      20   0  113164   8592   1476 S  0.0  0.0   4:45.09 vector
2731692 root      20   0  113164   8592   1476 S  0.0  0.0   0:00.00 tokio-runtime-w
2731693 root      20   0  113164   8592   1476 S  0.0  0.0  14:55.40 rdk:main
2731694 root      20   0  113164   8592   1476 S  0.0  0.0   0:37.29 rdk:broker-1
2731695 root      20   0  113164   8592   1476 S  0.0  0.0  30:48.26 rdk:broker4369
2731696 root      20   0  113164   8592   1476 S  0.0  0.0  47:36.97 rdk:broker4368
2731698 root      20   0  113164   8592   1476 S  0.0  0.0  33:59.28 rdk:broker4371
2731699 root      20   0  113164   8592   1476 S  0.0  0.0   5:29.46 producer pollin
2731706 root      20   0  113164   8592   1476 S  0.0  0.0   0:51.94 tokio-runtime-w
2731707 root      20   0  113164   8592   1476 S  0.0  0.0   0:52.46 tokio-runtime-w
2731708 root      20   0  113164   8592   1476 S  0.0  0.0   0:52.78 tokio-runtime-w
2731709 root      20   0  113164   8592   1476 S  0.0  0.0   0:52.14 tokio-runtime-w
2731710 root      20   0  113164   8592   1476 S  0.0  0.0   0:00.00 vector
2731711 root      20   0  113164   8592   1476 S  0.0  0.0 773:32.73 tokio-runtime-w

strace -p 2731697

poll([{fd=29, events=POLLIN|POLLOUT}, {fd=24, events=POLLIN}], 2, 172) = 1 ([{fd=29, revents=POLLOUT}])
read(29, 0x281e783, 5)                  = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=29, events=POLLIN|POLLOUT}, {fd=24, events=POLLIN}], 2, 172) = 1 ([{fd=29, revents=POLLOUT}])
read(29, 0x281e783, 5)                  = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=29, events=POLLIN|POLLOUT}, {fd=24, events=POLLIN}], 2, 172) = 1 ([{fd=29, revents=POLLOUT}])
read(29, 0x281e783, 5)                  = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=29, events=POLLIN|POLLOUT}, {fd=24, events=POLLIN}], 2, 172) = 1 ([{fd=29, revents=POLLOUT}])
read(29, 0x281e783, 5)                  = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=29, events=POLLIN|POLLOUT}, {fd=24, events=POLLIN}], 2, 172) = 1 ([{fd=29, revents=POLLOUT}])
read(29, 0x281e783, 5)                  = -1 EAGAIN (Resource temporarily unavailable)

this log stuck in infinite loop

I ended up restarting the process, but the problem reoccurred a few days later

It's similar with this issue, librdkafka 1397,
but this issue had been fixed by librdkafka team
whether libkafka is not configured correctly in vector, such as socket.timeout.ms

@binarylogic
Copy link
Contributor

Thanks @linshaoyong, we'll take a look and see what's going on. Appreciate the detailed issue.

@binarylogic binarylogic added source: kafka Anything `kafka` source related type: bug A code related bug. labels Feb 17, 2020
@ghost ghost self-assigned this Feb 17, 2020
@binarylogic binarylogic changed the title kafka sink thread stuck in infinite loop kafka sink thread stuck in infinite loop Feb 17, 2020
@binarylogic
Copy link
Contributor

@linshaoyong #1830 has been merged. This should resolve your issue. That PR adds the socket_timeout_ms, fetch_wait_max_ms, and librdkafka_options options. This should allow you to adjust the relevant settings appropriately. Could you let me know if this resolves your issue? And what are your thoughts on adjusting the default values for these options?

@linshaoyong
Copy link
Author

@binarylogic Thanks. I will try today, the problem can't be reproduced quickly, it will take a few days before i can respond.

@binarylogic
Copy link
Contributor

@linshaoyong no problem. My hope is that the socket_timeout_ms and fetch_wait_max_ms options will be sufficient, but if not you should be able to fiddle with all of the librdkafka options. Just let us know and we're happy to help you debug.

@linshaoyong
Copy link
Author

linshaoyong commented Feb 18, 2020

@binarylogic hi, maybe I should provide more information.

vector configuration:

[sources.log_source]
  type = "file"
  include = ["/var/log/a.log", "/var/log/b.log", "/var/log/c.log"]
  ...
[sinks.log_sink]
  type = "kafka"
  inputs = ["log_source"]
  bootstrap_servers = ["1.1.1.1:9093","1.1.1.2:9093","1.1.1.3:9093","1.1.1.4:9093"]
  ...
  [sinks.log_sink.tls]
    enabled = true
    ca_path = "/path/to/ca.pem"

vector.log:

Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.538  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:02.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)
Feb 18 14:21:04.709  ERROR sink{name=log_sink type=kafka}: vector::sinks::kafka: kafka error: Message production error: MessageTimedOut (Local: Message timed out)

top -n 1 -H -p [pid]

       PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
1103728 root      20   0  122256   9876   2060 R 99.9  0.0  38841:13 rdk:broker4370
1103726 root      20   0  122256   9876   2060 R 93.3  0.0  38841:30 rdk:broker4369
1103721 root      20   0  122256   9876   2060 S  0.0  0.0   0:00.24 vector
1103722 root      20   0  122256   9876   2060 S  0.0  0.0   4:39.52 vector
1103724 root      20   0  122256   9876   2060 S  0.0  0.0  19:32.06 rdk:main
1103725 root      20   0  122256   9876   2060 S  0.0  0.0   0:38.39 rdk:broker-1
1103727 root      20   0  122256   9876   2060 S  0.0  0.0   7:44.00 rdk:broker4368
1103729 root      20   0  122256   9876   2060 S  0.0  0.0   4:29.23 rdk:broker4371

agent(9.9.9.9) keep connection with each kafka broker
netstat -natp | grep 9093

tcp        0      0 9.9.9.9:48768     1.1.1.1:9093     ESTABLISHED 1103721/vector
tcp        0      0 9.9.9.9:32180     1.1.1.2:9093     ESTABLISHED 1103721/vector
tcp        0      0 9.9.9.9:32038     1.1.1.3:9093     ESTABLISHED 1103721/vector
tcp        0      0 9.9.9.9:23624     1.1.1.4:9093     ESTABLISHED 1103721/vector

On Kafka brokers, broker4369 and broker4370 have no connection with agent(9.9.9.9)
broker4368:
netstat -natp | grep 9.9.9.9
tcp6 0 0 1.1.1.1:9093 9.9.9.9:32180 ESTABLISHED 3156/java

broker4369:
netstat -natp | grep 9.9.9.9

broker4370:
netstat -natp | grep 9.9.9.9

broker4371:
netstat -natp | grep 9.9.9.9
tcp6 0 0 1.1.1.4:9093 9.9.9.9:23624 ESTABLISHED 3156/java

Kafka broker closed the connection but the client didn‘t know. in this case, the two CPUs run to 100%.
After restarting the process, the problem disappeared, but it will reappear after a few days.
Maybe changing the configuration should not solve this problem.

When we have thousands of agents, this happens occasionally. Can vector reconnect or make other improvements for this situation?

Thank you

@linshaoyong
Copy link
Author

linshaoyong commented Feb 19, 2020

Sorry, my initial guess was misleading, maybe this issue should help.

@ghost
Copy link

ghost commented Feb 19, 2020

@linshaoyong Could you please try adding librdkafka_options."reconnect.backoff.ms" = "2000" to the list of Kafka options using the latest nightly? Like this:

[sinks.log_sink]
  type = "kafka"
  inputs = ["log_source"]
  bootstrap_servers = ["1.1.1.1:9093","1.1.1.2:9093","1.1.1.3:9093","1.1.1.4:9093"]
  [sinks.log_sink.librdkafka_options]
    "reconnect.backoff.ms" = "2000"

The default reconnect backoff timeout in rdkafka is 100ms and such a small timeout does create CPU load with this timeout. On the other hand, this guarantees that the reconnect would happen as fast as possible.

Setting reconnect.backoff.ms to large values greately reduced the CPU load in my tests, although it increased the wait time, which might not be suitable for all users.

@linshaoyong
Copy link
Author

linshaoyong commented Feb 20, 2020

@a-rodin I have deployed the latest nightly on one node, about 30 agents will be deployed today.
But I still don't understand the cause of this problem, CPU spends on loop reads

poll([{fd=29, events=POLLIN|POLLOUT}, {fd=24, events=POLLIN}], 2, 172) = 1 ([{fd=29, revents=POLLOUT}])
read(29, 0x281e783, 5) = -1 EAGAIN (Resource temporarily unavailable)

@binarylogic binarylogic added the domain: observability Anything related to monitoring/observing Vector label Feb 20, 2020
@ghost
Copy link

ghost commented Feb 20, 2020

@a-rodin I have deployed the latest nightly on one node, about 30 agents will be deployed today.
But I still don't understand the cause of this problem, CPU spends on loop reads

So just to ensure, is it still observed when "reconnect.backoff.ms" is set to "2000" or a larger value?

@linshaoyong
Copy link
Author

@a-rodin yes, reconnect.backoff.ms="2000" the problem still appears

@ghost
Copy link

ghost commented Mar 2, 2020

confluentinc/librdkafka#2363 (comment) suggests that the problem could have been fixed in librdkafka 1.3.x. The librdkafka dependency was updated in #1928, so recent nightly versions of Vector use it.

@linshaoyong
Copy link
Author

linshaoyong commented Mar 4, 2020

@a-rodin Today I deployed 23 agents, using vector 0.9.0-nightly (g560fd10 x86_64-unknown-linux-musl 2020-03-02), 3 of them set the environment variable LOG = "vector = debug, rdkafka = trace". If there is new progress, I'll come back.

update:

An agent use almost 400% cpu,below is the log of rdkafka.

Mar 04 16:46:44.004 ERROR rdkafka::client: librdkafka: Global error: OperationTimedOut (Local: Timed out): ssl://1.1.1.1:9093/4368: 2 request(s) timed out: disconnect (after 21527892ms in state UP)
Mar 04 16:46:54.107 ERROR rdkafka::client: librdkafka: Global error: OperationTimedOut (Local: Timed out): ssl://1.1.1.2:9093/4369: 1 request(s) timed out: disconnect (after 20978046ms in state UP)
Mar 04 16:47:02.662 ERROR rdkafka::client: librdkafka: Global error: OperationTimedOut (Local: Timed out): ssl://1.1.1.3:9093/4371: 2 request(s) timed out: disconnect (after 21546652ms in state UP)
Mar 04 16:47:03.113 ERROR rdkafka::client: librdkafka: Global error: OperationTimedOut (Local: Timed out): ssl://1.1.1.4:9093/4370: 6 request(s) timed out: disconnect (after 21548755ms in state UP)
Mar 04 16:47:03.113 ERROR rdkafka::client: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 4/4 brokers are down
Mar 04 16:50:28.792 ERROR rdkafka::client: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): ssl://1.1.1.4:9093/4370: Connect to ipv4#1.1.1.4:9093 failed: Operation timed out (after 132006ms in state CONNECT)
Mar 04 16:50:28.792 ERROR rdkafka::client: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 4/4 brokers are down
Mar 04 17:02:29.704 ERROR rdkafka::client: librdkafka: Global error: SSL (Local: SSL error): ssl://1.1.1.1:9093/4368: SSL handshake failed: SSL transport error: Operation timed out (after 945742ms in state CONNECT)
Mar 04 17:09:56.152 ERROR rdkafka::client: librdkafka: Global error: SSL (Local: SSL error): ssl://1.1.1.3:9093/4371: SSL handshake failed: SSL transport error: Operation timed out (after 982040ms in state CONNECT)

@ghost ghost added sink: kafka Anything `kafka` sink related and removed source: kafka Anything `kafka` source related labels Mar 4, 2020
@ghost
Copy link

ghost commented Mar 4, 2020

I think I know what is the reason: your Kafka version is 0.11.0.0 and our integration tests currently use Kafka 2.12, earlier versions are known to be supported by librdkafka but we haven't tested Vector with them. I tried to run tests with Kafka 0.11.0.0 and got a similar issue, the tests timed out. We should support all versions of Kafka since 0.9, but it turns out that there are some issues with versions earlier than 2.x. I've opened issue #1984 about this.

@binarylogic binarylogic unassigned ghost Apr 15, 2020
@jszwedko
Copy link
Member

jszwedko commented Aug 1, 2022

Closing since we have upgraded rdkafka which should resolve this issue.

@jszwedko jszwedko closed this as completed Aug 1, 2022
@eddy1o2
Copy link

eddy1o2 commented Oct 28, 2024

Hey folks, I still got this issue occasionally with the below errors. Can someone help me understand what is the root cause of this?

2024-10-25T15:24:11.398757Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3104}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(KafkaError (Message production error: MessageTimedOut (Local: Message timed out))) request_id=3104 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-10-25T15:24:11.398804Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3104}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-10-25T16:04:13.385631Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3112}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(KafkaError (Message production error: MessageTimedOut (Local: Message timed out))) request_id=3112 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-10-25T16:04:13.385673Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3112}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-10-25T16:24:11.393541Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3116}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(KafkaError (Message production error: MessageTimedOut (Local: Message timed out))) request_id=3116 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-10-25T16:24:11.393585Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3116}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true

Here is my vector config and I am using vector 0.33.0 with kafka sink in AWS MSK within TLS enabled

sinks:
  msk:
    type: kafka
    inputs:
      - k8s_log
    topic: k8s_log_snap
    bootstrap_servers: b-3.stg-msk-cluster-c.*****.c2.kafka.ap-southeast-1.amazonaws.com:9094
    compression: snappy
    tls:
      enabled: true
    encoding:
      codec: "json"
    batch:
      max_bytes: 600000
      timeout_secs: 30

@jszwedko
Copy link
Member

Hey folks, I still got this issue occasionally with the below errors. Can someone help me understand what is the root cause of this?

2024-10-25T15:24:11.398757Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3104}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(KafkaError (Message production error: MessageTimedOut (Local: Message timed out))) request_id=3104 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-10-25T15:24:11.398804Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3104}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-10-25T16:04:13.385631Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3112}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(KafkaError (Message production error: MessageTimedOut (Local: Message timed out))) request_id=3112 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-10-25T16:04:13.385673Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3112}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
2024-10-25T16:24:11.393541Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3116}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(KafkaError (Message production error: MessageTimedOut (Local: Message timed out))) request_id=3116 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-10-25T16:24:11.393585Z ERROR sink{component_kind="sink" component_id=msk component_type=kafka component_name=msk}:request{request_id=3116}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true

Here is my vector config and I am using vector 0.33.0 with kafka sink in AWS MSK within TLS enabled

sinks:
  msk:
    type: kafka
    inputs:
      - k8s_log
    topic: k8s_log_snap
    bootstrap_servers: b-3.stg-msk-cluster-c.*****.c2.kafka.ap-southeast-1.amazonaws.com:9094
    compression: snappy
    tls:
      enabled: true
    encoding:
      codec: "json"
    batch:
      max_bytes: 600000
      timeout_secs: 30

It sounds like you may need to increase the message timeout (message.timeout.secs). See https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md. These options are exposed via https://vector.dev/docs/reference/configuration/sinks/kafka/#librdkafka_options

@eddy1o2
Copy link

eddy1o2 commented Oct 29, 2024

Thanks @jszwedko for quick response, I already checked the default value of message.timeout.secs is 300000 ms ~ 5 mins. Think it's long enough for sending the message but cannot understand why the timeout happened. Any ways to trace the root cause from librdkafka in vector?

@jszwedko
Copy link
Member

Thanks @jszwedko for quick response, I already checked the default value of message.timeout.secs is 300000 ms ~ 5 mins. Think it's long enough for sending the message but cannot understand why the timeout happened. Any ways to trace the root cause from librdkafka in vector?

Ah, you are right, five minutes does feel quite long. Googling around a bit seems to indicate you should check the cluster/broker health to identify why it is taking so long to receive the message.

@eddy1o2
Copy link

eddy1o2 commented Oct 30, 2024

check the cluster/broker health

Thanks @jszwedko , could I know if vector support that mechanism? Does tweaking the config reconnect.backoff.ms help?

@jszwedko
Copy link
Member

jszwedko commented Oct 30, 2024

check the cluster/broker health

Thanks @jszwedko , could I know if vector support that mechanism?

Ah, no, I mean look at the health of the cluster itself. There are some suggestions on https://www.redpanda.com/guides/kafka-performance-kafka-monitoring, for example. It's been a while since I've administered a Kafka cluster and so I don't remember offhand the things to look at. The error you are seeing may mean that the cluster is overloaded or unhealthy.

@jszwedko
Copy link
Member

Does tweaking the config reconnect.backoff.ms help?

You could try, but I don't immediately see a relationship.

@eddy1o2
Copy link

eddy1o2 commented Oct 31, 2024

Thanks for your help @jszwedko , will try it 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector sink: kafka Anything `kafka` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

4 participants