Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more specific metrics and logging for DNS lookup failures #2179

Merged
merged 14 commits into from
Feb 28, 2024

Conversation

bjlaub
Copy link
Contributor

@bjlaub bjlaub commented Feb 27, 2024

Before this PR

DNS lookups that result in an UnknownHostException are logged, but the exception message is logged unsafely which can make it hard to determine what the actual failure was. Under the hood, the error returned by getaddrinfo() is converted to a string via gai_strerror and used as the message for the generated UnknownHostException, which can be useful in determining what type of failure actually triggered the exception.

After this PR

Try to match strings generated by gai_strerror against the exception message, and log only the EAI_* error types and corresponding messages from the operating system safely. Additionally we report a meter metric with the error type (e.g. EAI_NONAME for when a DNS lookup fails because the name is not actually known to the nameserver).

==COMMIT_MSG==
add more specific metrics and logging for DNS lookup failures
==COMMIT_MSG==

Possible downsides?

@bjlaub bjlaub requested a review from carterkozak February 27, 2024 21:56
@changelog-app
Copy link

changelog-app bot commented Feb 27, 2024

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

try to log error messages when getaddrinfo fails

Check the box to generate changelog(s)

  • Generate changelog entry

log.warn(
"Unknown host '{}'",
SafeArg.of("gaiErrorMessage", gaiError.getErrorMessage()),
SafeArg.of("gaiErrorType", gaiError.getErrorType()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on adding a meter metric tagged with the errorType so we can keep an eye on things more broadly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think that makes sense, will add


private static ExtractedGaiError extractGaiErrorString(UnknownHostException exception) {
try {
for (Map.Entry<String, String> entry : EXPECTED_GAI_ERROR_STRINGS.entrySet()) {
Copy link
Contributor

@carterkozak carterkozak Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we account for cached failures with an initial check that the message doesn't exactly match the input hostname value? (Perhaps return ImmutableExtractedGaiError.of("cached", "cached"); in that case)

Comment on lines 98 to 116
StackTraceElement[] trace = exception.getStackTrace();
if (trace.length > 0) {
StackTraceElement top = trace[0];
if ("java.net.InetAddress$CachedLookup".equals(top.getClassName())) {
return ImmutableExtractedGaiError.of("cached", "cached");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cached case failures can be checked most easily using Objects.equals(requestedHostname, exception.getMessage()) -- either way we should add a test for this.

return ImmutableExtractedGaiError.of("cached", "cached");
}

for (Map.Entry<String, String> entry : EXPECTED_GAI_ERROR_STRINGS.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this loop needs to be predicated on if (trace.length > 0) {

Comment on lines 93 to 108
if (exception == null) {
return ImmutableExtractedGaiError.of("null", "null");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to handle null here since we're calling from our own code on catch (UnknownHostException

Comment on lines 83 to 84
@Value.Immutable
interface ExtractedGaiError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can combine the map and type here into a single enum e.g.

enum DNS_ERROR {
    EAI_ADDRFAMILY("Address family for hostname not supported"),
    EAI_AGAIN("Temporary failure in name resolution"),
    /*etc*/
    CACHED(), // explicitly avoid setting a substring matcher for 'cached' and 'unknown' special cases
    UNKNOWN()
}

@bjlaub bjlaub force-pushed the blaub/log-getaddrinfo-error-strings branch from 292e6a9 to ca9e483 Compare February 28, 2024 17:43
@@ -68,6 +68,8 @@ Dialogue DNS metrics.
- `success`: DNS resolution succeeded using `InetAddress.getAllByName`.
- `fallback`: DNS resolution using the primary mechanism failed, however addresses were available in the fallback cache.
- `failure`: No addresses could be resolved for the given hostname.
- `client.dns.lookupError` (meter): DNS resolver query failures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's match the failure wording above, perhaps naming this client.dns.failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

private static GaiError extractGaiErrorString(UnknownHostException exception) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps extractGaiError since this returns GaiError, not String

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

TaggedMetricRegistry registry = new DefaultTaggedMetricRegistry();
DialogueDnsResolver resolver = new DefaultDialogueDnsResolver(registry);

String badHost = "alksdjflajsdlkfjalksjflkadjsf.com";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone could pay ~$10 to break this test.

Let's use something like:

String badHost = UUID.randomUUID() + ".palantir.com";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that only went into one of the two tests using this pattern, we need to update both

ImmutableSet<InetAddress> result2 = resolver.resolve(badHost);
assertThat(result2).isEmpty();
ClientDnsMetrics metrics = ClientDnsMetrics.of(registry);
assertThat(metrics.lookupError("CACHED").getCount()).isGreaterThan(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEqualTo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

String badHost = "alksdjflajsdlkfjalksjflkadjsf.com";
ImmutableSet<InetAddress> result = resolver.resolve(badHost);

assertThat(result).isEmpty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add assertThat(metrics.lookupError("EAI_NONAME").getCount()).isEqualTo(1); in addition to this assert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@bjlaub bjlaub changed the title try to log error messages when getaddrinfo fails add metrics and logging for DNS lookup failures Feb 28, 2024
@bjlaub bjlaub changed the title add metrics and logging for DNS lookup failures add more specific metrics and logging for DNS lookup failures Feb 28, 2024
// should resolve from cache
ImmutableSet<InetAddress> result2 = resolver.resolve(badHost);
assertThat(result2).isEmpty();
assertThat(metrics.failure("CACHED").getCount()).isGreaterThan(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.isEqualTo(1);

log.warn("Unknown host '{}'", UnsafeArg.of("hostname", hostname), e);
GaiError gaiError = extractGaiError(e, hostname);
log.warn(
"Unknown host '{}'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Unknown host '{}'",
"Unknown host '{}'. {}: {}",

@bulldozer-bot bulldozer-bot bot merged commit 409307a into develop Feb 28, 2024
6 checks passed
@bulldozer-bot bulldozer-bot bot deleted the blaub/log-getaddrinfo-error-strings branch February 28, 2024 18:48
@svc-autorelease
Copy link
Collaborator

Released 3.117.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants