[LI-HOTFIX] Reduce exception log spam in producer IO thread #525

dtwitty · 2025-01-16T23:37:37Z

Fixes LIKAFKA-62226

Certain issues, such as DNS resolution failures, can cause the producer IO thread to spew log messages rapidly. This can cause disks to fill and hide other issues. This commit addresses this issue by adding a producer configuration that throttles exception log messages from the top of the producer IO thread. The throttling is a simple cooldown-based rate limiter, and is applied per-exception. Exceptions are differentiated based on their printStackTrace result, which covers exception type, construction point, causes, etc.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

Yellow-Rice · 2025-01-17T21:20:03Z

clients/src/main/java/org/apache/kafka/common/utils/FixedRateLimiter.java

+        return false;
+    }
+
+    private long delayBetweenPermitsNs() {


The value returned by the method is a constant for each instance. It should be safe to calculate once at the construction time.

Yellow-Rice · 2025-01-17T21:23:16Z

clients/src/main/java/org/apache/kafka/common/utils/FixedRateLimiter.java

+ */
+package org.apache.kafka.common.utils;
+
+public class FixedRateLimiter implements RateLimiter {


Can you mark it as @NotThreadSafe? There might be other use cases later.

I can't find @NotThreadSafe. I could make it thread safe by making tryAcquire() synchronized.

I may not remember the name correctly. Making a comment should be enough.

Yellow-Rice · 2025-01-17T21:26:11Z

clients/src/test/java/org/apache/kafka/common/utils/ExceptionMapTest.java

+
+import static org.junit.Assert.assertEquals;
+
+public class ExceptionMapTest {


Yellow-Rice · 2025-01-17T21:30:05Z

clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java

@@ -277,6 +276,10 @@ public class ProducerConfig extends AbstractConfig {

    public static final String LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG = CommonClientConfigs.LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG;

+    public static final String IO_THREAD_EXCEPTION_LOG_FREQUENCY_CONFIG = "io.thread.exception.log.frequency";


When searching for frequency on the Apache Kafka doc, people tends to use *.interval.[ms,secods,...]> over hertz. Can we follow the convention?

second this. Frequency is not as straightforward for users/readers as intervals

FelixGV · 2025-01-22T21:14:17Z

clients/src/main/java/org/apache/kafka/common/utils/ExceptionMap.java

+            return "";
+        }
+
+        return Utils.stackTrace(e);


That is expensive... it is good to protect the logs, but we may still pay a high CPU price for all these exceptions getting muted. Could it be good enough if the map's key was just the exception's message, or else the exception's message concatenated with the cause's message (only one cause deep, and only if there is one)?

That's a good point about CPU usage, especially during a downstream outage. Currently we use stacktrace because it is extremely specific and covers all edge cases. However, if the log messages and exception types are diverse enough, I think the following may be sufficient:

Exception type

Exception message

Cause type

Cause message

This would allow deduplication of the following scenarios:

A specific cause wrapped in a catch-all Exception

Conversion of checked exceptions into unchecked exceptions

Nonspecific error messages

This is also still cheap compared to stacktrace computation, as all fields are already computed.

WDYT?

SGTM. Thanks!

Yellow-Rice · 2025-01-23T18:45:19Z

clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java

-    public static final String IO_THREAD_EXCEPTION_LOG_FREQUENCY_DOC = "The frequency of logging uncaught exceptions from the producer I/O thread. The value is in hertz (hz). If the value is less than or equal to 0, all exceptions will be logged.";
-    public static final double DEFAULT_IO_THREAD_EXCEPTION_LOG_FREQUENCY = 0;
+    public static final String IO_THREAD_EXCEPTION_LOG_INTERVAL_MS_CONFIG = "io.thread.exception.log.interval.ms";
+    public static final String IO_THREAD_EXCEPTION_LOG_INTERVAL_MS_DOC = "The minimum time is milliseconds between logging identical uncaught exceptions from the producer I/O thread. If the value is less than or equal to 0, all exceptions will be logged.";


time is -> time in

2. fix typo

dtwitty requested review from kehuum and Yellow-Rice January 16, 2025 23:37

Reduce exception log spam in producer IO thread

6016623

dtwitty force-pushed the 2.4-li branch from 1566900 to 6016623 Compare January 17, 2025 18:11

Yellow-Rice reviewed Jan 17, 2025

View reviewed changes

FelixGV reviewed Jan 22, 2025

View reviewed changes

Address comments

12bc53a

Yellow-Rice reviewed Jan 23, 2025

View reviewed changes

1. fix unit test

7395242

2. fix typo

Yellow-Rice approved these changes Jan 24, 2025

View reviewed changes

Q1Liu merged commit e15c196 into linkedin:2.4-li Jan 24, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LI-HOTFIX] Reduce exception log spam in producer IO thread #525

[LI-HOTFIX] Reduce exception log spam in producer IO thread #525

dtwitty commented Jan 16, 2025

Yellow-Rice Jan 17, 2025

Yellow-Rice Jan 17, 2025

dtwitty Jan 23, 2025

Yellow-Rice Jan 23, 2025

Yellow-Rice Jan 17, 2025

Yellow-Rice Jan 17, 2025

kehuum Jan 17, 2025

dtwitty Jan 23, 2025

FelixGV Jan 22, 2025

dtwitty Jan 23, 2025

FelixGV Jan 23, 2025

Yellow-Rice Jan 23, 2025


		import static org.junit.Assert.assertEquals;

		public class ExceptionMapTest {

		@@ -277,6 +276,10 @@ public class ProducerConfig extends AbstractConfig {

		public static final String LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG = CommonClientConfigs.LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG;

		public static final String IO_THREAD_EXCEPTION_LOG_FREQUENCY_CONFIG = "io.thread.exception.log.frequency";

[LI-HOTFIX] Reduce exception log spam in producer IO thread #525

[LI-HOTFIX] Reduce exception log spam in producer IO thread #525

Conversation

dtwitty commented Jan 16, 2025

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment