Skip to content

DNS queries with unknown datacenters can cause excessive load on consul servers and force agents to run out of file descriptors #807

Closed
@primal-github

Description

@primal-github

If a consul agent receives DNS queries of the form someservice.service.falsedc.domain.consul these queries will cause excessive load on the consul servers along with log lines of the form [WARN] consul.rpc: RPC request for DC 'falsedc', no path found. At a glance it seems like the server should fail early when it cannot find the datacenter, but instead recurses until the request's TTL is reached and dropped.

Furthermore the agent that received the query will show log lines of the form [ERR] dns: rpc error: rpc error: No path to datacenter. Furthermore if the agent receives these queries at a moderate rate it will eventually run out of file descriptors. I suspect that perhaps a new socket is opened for each pending query. This is not necessarily bad as responses should be fast, but the first part of this issue causes consul to open more and more sockets until it can't open any more. The errors from this scenario also cause the consul agent to write gigabytes of logs within minutes.

The issue can be replicated on a Linux system which has the consul agent set as its nameserver (e.g. via binding to port 53 or via dnsmasq) by adding domain.consul to the search domains in /etc/resolv.conf (e.g. search domain.consul) and running queries of the format someservice.service.domain.consul, which get expanded by the resolver to someservice.service.domain.consul.domain.consul. However I'm fairly certain that this is just a special case, and that the issue should be reproducible with any nonexisting datacenter and any consul domain.

Activity

armon

armon commented on Mar 24, 2015

@armon
Member

Hmm interesting. All of this is mostly expected behavior with the exception of running out of file descriptors. I'm going to tag this as a bug to investigate that issue.

added
type/bugFeature does not function as expected
on Mar 24, 2015
added this to the 0.5.1 milestone on Apr 9, 2015
armon

armon commented on May 5, 2015

@armon
Member

@primal-github I think this was actually caused by an unrelated issue in the connection pooling between servers. If an RPC returned an error, the connection would not be reused. In this case, an invalid domain would always cause and error, so each query would start a new internal connection. This looks to be resolved in master!

armon

armon commented on May 5, 2015

@armon
Member

I'm just closing for now, but please comment / re-open if you see this again!

frankfarmer

frankfarmer commented on May 29, 2015

@frankfarmer

Might this be related to #688 ?

primal-github

primal-github commented on May 29, 2015

@primal-github
Author

@frankfarmer In this case they were both co-occurring. As @armon mentioned this was likely caused by the lack of connection reuse, which may have in turn triggered excessive file descriptors being used. We addressed the cause (fixed our dns lookups) so we haven't had the urge to replicate it again.

igoratencompass

igoratencompass commented on Jul 6, 2018

@igoratencompass

Interesting, because I'm seeing the same warnings coming in tens per second except in my case the DC has the correct name. This is on 0.9.3

added a commit that references this issue on Oct 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugFeature does not function as expected

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @frankfarmer@armon@primal-github@igoratencompass

        Issue actions

          DNS queries with unknown datacenters can cause excessive load on consul servers and force agents to run out of file descriptors · Issue #807 · hashicorp/consul