-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ruler][DNS] Don't propagate no such host error if using default resolver #3257
[Ruler][DNS] Don't propagate no such host error if using default resolver #3257
Conversation
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Shall there be a debug log entry when this happens 🤔 ? |
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
It's not the only a ruler who is using this, also Querier and other components and projects including Cortex.
I wonder if this makes sense, maybe putting it behind some option would do 🤔
No host found is when you cannot reach DNS host and this is definitely a configuration, server or just access failure. Imagine you are rolling to new cluster and somehow pod cannot each DNS server. By masking this error we will never know about this error right?
What exactly you want to achieve @OGKevin what's the use case? (: |
AFAIK, Host not found is also returned when the DNS server is reachable. Have a look at this article for example.
The DNS server is working fine, the DNS just does not exits, hence the NXDOMAIN which in go results in a The issue linked in #3186 has more context of the use case that we want to avoid. Thought, this use case is ruler specific 🤔 not sure how other components handle this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I checked a bit and it looks like indeed IsNotFound is returned when EAI_NONAME
is given by Linux resolver (getaddrinfo) so you might be right.
Still I would consider this as an error, potentially a configuration error case. (imagine you made a typo in crafting a DNS target for alertmananger).
I think the main question regarding your issue is really, should a ruler just mention the error and continue running or not. This decision might orthogonal to this change and in fact depends what it resolves for (key functionality or not) 🤔
Some more opinions welcome, but I would rather stop crashing ruler on fail like this on start, but still log & instrument it as failure overall.
The problem might not be a config issue only. If Alertmanagers was up and running fine, and the DNS name becomes invalid because all pods are down. Ruler wil crashloop without making any config change to ruler. So IMO Ruler should for sure not crash because Alertmanagers can't be resolved. Because ruler is not only responsible for evaluating alerting rules. I also agree that there should be a log entry saying that DNS resolution failed. On the other issues, e.g. how it impacts the other components I can't make a clear assessment. What we could do indeed is, add a flag and set this to "true" by default for ruler, and false for the other components so that the behaviour only changes for Ruler. |
Hm, crashlooop will actually apply all config changes right? |
Let me try to rephrase: ruler will crashloop not because of a config change made by a human to ruler. Ruler will go in a crashloop because a dependency, Alertmanager, is down and k8s service FQDN resolution returns a NXDOMAIN (or equivalent) and dies. Subsequently, ruler won't start up because of the DNS resolution failure and eventually ends up in a crash loop. Line 418 in e4941a5
Lines 792 to 798 in e4941a5
|
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
87462b9
to
e30a981
Compare
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
The link that is being reported as dead is not actually dead. I dont quite understand why e2e failed 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to dns
package looks good to me.
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for offline chat, I think this makes sense, but let's make it a normal behaviour.
Some tests would be nice as well (:
pkg/discovery/dns/resolver.go
Outdated
if dnsErr, ok := err.(*net.DNSError); !ok || !dnsErr.IsNotFound || s.returnErrOnNotFound { | ||
return nil, errors.Wrapf(err, "lookup IP addresses %q", host) | ||
} | ||
level.Error(s.logger).Log("msg", "failed to lookup IP addresses", "host", host, "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's delete this and if only, let's have one for both miekg and Go DNS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I don't quite understand this comment 🤔. Can you elaborate a lillte bit more?
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am fine with this consistency check, thanks!
Can you rebase 🤗 |
a9489a3
to
19ab4a3
Compare
@bwplotka I could not rebase as I merged master in this branch before. Rebasing now will cause a headache. Do you want me to squash this to 1 commit? |
Nah no need, GH squashes this. Thanks and sorry for long discussion (: |
Signed-off-by: Kevin Hellemun 17928966+OGKevin@users.noreply.github.com
Fixes #3186
Changes
Verification