Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS queries with unknown datacenters can cause excessive load on consul servers and force agents to run out of file descriptors #807

Closed
primal-github opened this issue Mar 20, 2015 · 6 comments
Labels
type/bug Feature does not function as expected
Milestone

Comments

@primal-github
Copy link

If a consul agent receives DNS queries of the form someservice.service.falsedc.domain.consul these queries will cause excessive load on the consul servers along with log lines of the form [WARN] consul.rpc: RPC request for DC 'falsedc', no path found. At a glance it seems like the server should fail early when it cannot find the datacenter, but instead recurses until the request's TTL is reached and dropped.

Furthermore the agent that received the query will show log lines of the form [ERR] dns: rpc error: rpc error: No path to datacenter. Furthermore if the agent receives these queries at a moderate rate it will eventually run out of file descriptors. I suspect that perhaps a new socket is opened for each pending query. This is not necessarily bad as responses should be fast, but the first part of this issue causes consul to open more and more sockets until it can't open any more. The errors from this scenario also cause the consul agent to write gigabytes of logs within minutes.

The issue can be replicated on a Linux system which has the consul agent set as its nameserver (e.g. via binding to port 53 or via dnsmasq) by adding domain.consul to the search domains in /etc/resolv.conf (e.g. search domain.consul) and running queries of the format someservice.service.domain.consul, which get expanded by the resolver to someservice.service.domain.consul.domain.consul. However I'm fairly certain that this is just a special case, and that the issue should be reproducible with any nonexisting datacenter and any consul domain.

@armon
Copy link
Member

armon commented Mar 24, 2015

Hmm interesting. All of this is mostly expected behavior with the exception of running out of file descriptors. I'm going to tag this as a bug to investigate that issue.

@armon armon added the type/bug Feature does not function as expected label Mar 24, 2015
@armon armon added this to the 0.5.1 milestone Apr 9, 2015
@armon
Copy link
Member

armon commented May 5, 2015

@primal-github I think this was actually caused by an unrelated issue in the connection pooling between servers. If an RPC returned an error, the connection would not be reused. In this case, an invalid domain would always cause and error, so each query would start a new internal connection. This looks to be resolved in master!

@armon
Copy link
Member

armon commented May 5, 2015

I'm just closing for now, but please comment / re-open if you see this again!

@armon armon closed this as completed May 5, 2015
@frankfarmer
Copy link

Might this be related to #688 ?

@primal-github
Copy link
Author

@frankfarmer In this case they were both co-occurring. As @armon mentioned this was likely caused by the lack of connection reuse, which may have in turn triggered excessive file descriptors being used. We addressed the cause (fixed our dns lookups) so we haven't had the urge to replicate it again.

@igoratencompass
Copy link

Interesting, because I'm seeing the same warnings coming in tens per second except in my case the DC has the correct name. This is on 0.9.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants