-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add more specific metrics and logging for DNS lookup failures #2179
Conversation
Generate changelog in
|
log.warn( | ||
"Unknown host '{}'", | ||
SafeArg.of("gaiErrorMessage", gaiError.getErrorMessage()), | ||
SafeArg.of("gaiErrorType", gaiError.getErrorType()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on adding a meter
metric tagged with the errorType
so we can keep an eye on things more broadly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think that makes sense, will add
|
||
private static ExtractedGaiError extractGaiErrorString(UnknownHostException exception) { | ||
try { | ||
for (Map.Entry<String, String> entry : EXPECTED_GAI_ERROR_STRINGS.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we account for cached failures with an initial check that the message doesn't exactly match the input hostname value? (Perhaps return ImmutableExtractedGaiError.of("cached", "cached");
in that case)
StackTraceElement[] trace = exception.getStackTrace(); | ||
if (trace.length > 0) { | ||
StackTraceElement top = trace[0]; | ||
if ("java.net.InetAddress$CachedLookup".equals(top.getClassName())) { | ||
return ImmutableExtractedGaiError.of("cached", "cached"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the cached case failures can be checked most easily using Objects.equals(requestedHostname, exception.getMessage())
-- either way we should add a test for this.
return ImmutableExtractedGaiError.of("cached", "cached"); | ||
} | ||
|
||
for (Map.Entry<String, String> entry : EXPECTED_GAI_ERROR_STRINGS.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this loop needs to be predicated on if (trace.length > 0) {
if (exception == null) { | ||
return ImmutableExtractedGaiError.of("null", "null"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need to handle null here since we're calling from our own code on catch (UnknownHostException
@Value.Immutable | ||
interface ExtractedGaiError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can combine the map and type here into a single enum
e.g.
enum DNS_ERROR {
EAI_ADDRFAMILY("Address family for hostname not supported"),
EAI_AGAIN("Temporary failure in name resolution"),
/*etc*/
CACHED(), // explicitly avoid setting a substring matcher for 'cached' and 'unknown' special cases
UNKNOWN()
}
292e6a9
to
ca9e483
Compare
dialogue-clients/metrics.md
Outdated
@@ -68,6 +68,8 @@ Dialogue DNS metrics. | |||
- `success`: DNS resolution succeeded using `InetAddress.getAllByName`. | |||
- `fallback`: DNS resolution using the primary mechanism failed, however addresses were available in the fallback cache. | |||
- `failure`: No addresses could be resolved for the given hostname. | |||
- `client.dns.lookupError` (meter): DNS resolver query failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's match the failure
wording above, perhaps naming this client.dns.failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
} | ||
|
||
private static GaiError extractGaiErrorString(UnknownHostException exception) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps extractGaiError
since this returns GaiError
, not String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
TaggedMetricRegistry registry = new DefaultTaggedMetricRegistry(); | ||
DialogueDnsResolver resolver = new DefaultDialogueDnsResolver(registry); | ||
|
||
String badHost = "alksdjflajsdlkfjalksjflkadjsf.com"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyone could pay ~$10 to break this test.
Let's use something like:
String badHost = UUID.randomUUID() + ".palantir.com";
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that only went into one of the two tests using this pattern, we need to update both
…ached negative lookup
ImmutableSet<InetAddress> result2 = resolver.resolve(badHost); | ||
assertThat(result2).isEmpty(); | ||
ClientDnsMetrics metrics = ClientDnsMetrics.of(registry); | ||
assertThat(metrics.lookupError("CACHED").getCount()).isGreaterThan(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isEqualTo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
String badHost = "alksdjflajsdlkfjalksjflkadjsf.com"; | ||
ImmutableSet<InetAddress> result = resolver.resolve(badHost); | ||
|
||
assertThat(result).isEmpty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add assertThat(metrics.lookupError("EAI_NONAME").getCount()).isEqualTo(1);
in addition to this assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
// should resolve from cache | ||
ImmutableSet<InetAddress> result2 = resolver.resolve(badHost); | ||
assertThat(result2).isEmpty(); | ||
assertThat(metrics.failure("CACHED").getCount()).isGreaterThan(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.isEqualTo(1);
log.warn("Unknown host '{}'", UnsafeArg.of("hostname", hostname), e); | ||
GaiError gaiError = extractGaiError(e, hostname); | ||
log.warn( | ||
"Unknown host '{}'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Unknown host '{}'", | |
"Unknown host '{}'. {}: {}", |
Released 3.117.0 |
Before this PR
DNS lookups that result in an UnknownHostException are logged, but the exception message is logged unsafely which can make it hard to determine what the actual failure was. Under the hood, the error returned by
getaddrinfo()
is converted to a string viagai_strerror
and used as the message for the generatedUnknownHostException
, which can be useful in determining what type of failure actually triggered the exception.After this PR
Try to match strings generated by
gai_strerror
against the exception message, and log only theEAI_*
error types and corresponding messages from the operating system safely. Additionally we report a meter metric with the error type (e.g.EAI_NONAME
for when a DNS lookup fails because the name is not actually known to the nameserver).==COMMIT_MSG==
add more specific metrics and logging for DNS lookup failures
==COMMIT_MSG==
Possible downsides?