Description
If a consul agent receives DNS queries of the form someservice.service.falsedc.domain.consul
these queries will cause excessive load on the consul servers along with log lines of the form [WARN] consul.rpc: RPC request for DC 'falsedc', no path found
. At a glance it seems like the server should fail early when it cannot find the datacenter, but instead recurses until the request's TTL is reached and dropped.
Furthermore the agent that received the query will show log lines of the form [ERR] dns: rpc error: rpc error: No path to datacenter
. Furthermore if the agent receives these queries at a moderate rate it will eventually run out of file descriptors. I suspect that perhaps a new socket is opened for each pending query. This is not necessarily bad as responses should be fast, but the first part of this issue causes consul to open more and more sockets until it can't open any more. The errors from this scenario also cause the consul agent to write gigabytes of logs within minutes.
The issue can be replicated on a Linux system which has the consul agent set as its nameserver (e.g. via binding to port 53 or via dnsmasq) by adding domain.consul
to the search domains in /etc/resolv.conf
(e.g. search domain.consul
) and running queries of the format someservice.service.domain.consul
, which get expanded by the resolver to someservice.service.domain.consul.domain.consul
. However I'm fairly certain that this is just a special case, and that the issue should be reproducible with any nonexisting datacenter and any consul domain.
Activity
armon commentedon Mar 24, 2015
Hmm interesting. All of this is mostly expected behavior with the exception of running out of file descriptors. I'm going to tag this as a bug to investigate that issue.
armon commentedon May 5, 2015
@primal-github I think this was actually caused by an unrelated issue in the connection pooling between servers. If an RPC returned an error, the connection would not be reused. In this case, an invalid domain would always cause and error, so each query would start a new internal connection. This looks to be resolved in master!
armon commentedon May 5, 2015
I'm just closing for now, but please comment / re-open if you see this again!
frankfarmer commentedon May 29, 2015
Might this be related to #688 ?
primal-github commentedon May 29, 2015
@frankfarmer In this case they were both co-occurring. As @armon mentioned this was likely caused by the lack of connection reuse, which may have in turn triggered excessive file descriptors being used. We addressed the cause (fixed our dns lookups) so we haven't had the urge to replicate it again.
igoratencompass commentedon Jul 6, 2018
Interesting, because I'm seeing the same warnings coming in tens per second except in my case the DC has the correct name. This is on 0.9.3
Drop support for Helm 2 (hashicorp#807)