-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LI-HOTFIX] Reduce exception log spam in producer IO thread #525
Conversation
return false; | ||
} | ||
|
||
private long delayBetweenPermitsNs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value returned by the method is a constant for each instance. It should be safe to calculate once at the construction time.
*/ | ||
package org.apache.kafka.common.utils; | ||
|
||
public class FixedRateLimiter implements RateLimiter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mark it as @NotThreadSafe
? There might be other use cases later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find @NotThreadSafe
. I could make it thread safe by making tryAcquire()
synchronized
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not remember the name correctly. Making a comment should be enough.
|
||
import static org.junit.Assert.assertEquals; | ||
|
||
public class ExceptionMapTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -277,6 +276,10 @@ public class ProducerConfig extends AbstractConfig { | |||
|
|||
public static final String LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG = CommonClientConfigs.LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG; | |||
|
|||
public static final String IO_THREAD_EXCEPTION_LOG_FREQUENCY_CONFIG = "io.thread.exception.log.frequency"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When searching for frequency
on the Apache Kafka doc, people tends to use *.interval.[ms,secods,...]>
over hertz
. Can we follow the convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second this. Frequency is not as straightforward for users/readers as intervals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
return ""; | ||
} | ||
|
||
return Utils.stackTrace(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is expensive... it is good to protect the logs, but we may still pay a high CPU price for all these exceptions getting muted. Could it be good enough if the map's key was just the exception's message, or else the exception's message concatenated with the cause's message (only one cause deep, and only if there is one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point about CPU usage, especially during a downstream outage. Currently we use stacktrace because it is extremely specific and covers all edge cases. However, if the log messages and exception types are diverse enough, I think the following may be sufficient:
- Exception type
- Exception message
- Cause type
- Cause message
This would allow deduplication of the following scenarios:
- A specific cause wrapped in a catch-all Exception
- Conversion of checked exceptions into unchecked exceptions
- Nonspecific error messages
This is also still cheap compared to stacktrace computation, as all fields are already computed.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. Thanks!
public static final String IO_THREAD_EXCEPTION_LOG_FREQUENCY_DOC = "The frequency of logging uncaught exceptions from the producer I/O thread. The value is in hertz (hz). If the value is less than or equal to 0, all exceptions will be logged."; | ||
public static final double DEFAULT_IO_THREAD_EXCEPTION_LOG_FREQUENCY = 0; | ||
public static final String IO_THREAD_EXCEPTION_LOG_INTERVAL_MS_CONFIG = "io.thread.exception.log.interval.ms"; | ||
public static final String IO_THREAD_EXCEPTION_LOG_INTERVAL_MS_DOC = "The minimum time is milliseconds between logging identical uncaught exceptions from the producer I/O thread. If the value is less than or equal to 0, all exceptions will be logged."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time is -> time in
2. fix typo
Fixes LIKAFKA-62226
Certain issues, such as DNS resolution failures, can cause the producer IO thread to spew log messages rapidly. This can cause disks to fill and hide other issues. This commit addresses this issue by adding a producer configuration that throttles exception log messages from the top of the producer IO thread. The throttling is a simple cooldown-based rate limiter, and is applied per-exception. Exceptions are differentiated based on their
printStackTrace
result, which covers exception type, construction point, causes, etc.Committer Checklist (excluded from commit message)