Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LI-HOTFIX] Reduce exception log spam in producer IO thread #525

Merged
merged 3 commits into from
Jan 24, 2025

Conversation

dtwitty
Copy link

@dtwitty dtwitty commented Jan 16, 2025

Fixes LIKAFKA-62226

Certain issues, such as DNS resolution failures, can cause the producer IO thread to spew log messages rapidly. This can cause disks to fill and hide other issues. This commit addresses this issue by adding a producer configuration that throttles exception log messages from the top of the producer IO thread. The throttling is a simple cooldown-based rate limiter, and is applied per-exception. Exceptions are differentiated based on their printStackTrace result, which covers exception type, construction point, causes, etc.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@dtwitty dtwitty requested review from kehuum and Yellow-Rice January 16, 2025 23:37
return false;
}

private long delayBetweenPermitsNs() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value returned by the method is a constant for each instance. It should be safe to calculate once at the construction time.

*/
package org.apache.kafka.common.utils;

public class FixedRateLimiter implements RateLimiter {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mark it as @NotThreadSafe? There might be other use cases later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find @NotThreadSafe. I could make it thread safe by making tryAcquire() synchronized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not remember the name correctly. Making a comment should be enough.


import static org.junit.Assert.assertEquals;

public class ExceptionMapTest {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -277,6 +276,10 @@ public class ProducerConfig extends AbstractConfig {

public static final String LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG = CommonClientConfigs.LI_UPDATE_METADATA_LAST_REFRESH_TIME_UPON_NODE_DISCONNECT_CONFIG;

public static final String IO_THREAD_EXCEPTION_LOG_FREQUENCY_CONFIG = "io.thread.exception.log.frequency";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When searching for frequency on the Apache Kafka doc, people tends to use *.interval.[ms,secods,...]> over hertz. Can we follow the convention?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second this. Frequency is not as straightforward for users/readers as intervals

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

return "";
}

return Utils.stackTrace(e);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is expensive... it is good to protect the logs, but we may still pay a high CPU price for all these exceptions getting muted. Could it be good enough if the map's key was just the exception's message, or else the exception's message concatenated with the cause's message (only one cause deep, and only if there is one)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point about CPU usage, especially during a downstream outage. Currently we use stacktrace because it is extremely specific and covers all edge cases. However, if the log messages and exception types are diverse enough, I think the following may be sufficient:

  • Exception type
  • Exception message
  • Cause type
  • Cause message

This would allow deduplication of the following scenarios:

  • A specific cause wrapped in a catch-all Exception
  • Conversion of checked exceptions into unchecked exceptions
  • Nonspecific error messages

This is also still cheap compared to stacktrace computation, as all fields are already computed.

WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. Thanks!

public static final String IO_THREAD_EXCEPTION_LOG_FREQUENCY_DOC = "The frequency of logging uncaught exceptions from the producer I/O thread. The value is in hertz (hz). If the value is less than or equal to 0, all exceptions will be logged.";
public static final double DEFAULT_IO_THREAD_EXCEPTION_LOG_FREQUENCY = 0;
public static final String IO_THREAD_EXCEPTION_LOG_INTERVAL_MS_CONFIG = "io.thread.exception.log.interval.ms";
public static final String IO_THREAD_EXCEPTION_LOG_INTERVAL_MS_DOC = "The minimum time is milliseconds between logging identical uncaught exceptions from the producer I/O thread. If the value is less than or equal to 0, all exceptions will be logged.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time is -> time in

2. fix typo
@Q1Liu Q1Liu merged commit e15c196 into linkedin:2.4-li Jan 24, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants