Revisit how we're handling DNS resolution for API backend domains #131

GUI · 2014-10-02T16:42:02Z

In light of #129 and #130, I think the overall way we're handling DNS needs to be revisited at some point. I'm hoping we've nailed down most of these weird edge-cases, so our current system can keep humming along, but the way we're handling DNS lookups for API backends does seem over-complicated.

A quick summary of the current approach is that we have a process that resolves all the hostnames of API backend servers. When changes are detected it writes a new nginx config file referencing the IP address everything currently resolves to and reloads nginx (so nginx only references IPs, not hostnames).

This approach has become more complicated over time, because we make an attempt to stick with the current IP for a given host, rather than switching to a new IP as soon as we see it. This is done predominately to deal with backends like Akamai or ELBs, where the DNS resolves to a rotating collection of IPs. If we were to immediately acknowledge the IP seen in these cases, we would basically be reloading nginx every minute or two, due to the super-short TTL on those domains and the rotation of IPs. So generally speaking, we respect DNS TTLs, except for some of these CDN services, where the DNS seems to resolve to a rotating (but stable) collection of active IPs.

All of this complexity is obviously bad, since it's led to these weird edge-case bugs recently, so I'd like to streamline it.

Here's some additional random thoughts, ideas, or reasons we ended up where we are:

The reason for all this DNS resolving junk is that most proxies don't actually support resolving DNS updates internally. Basically they only resolve domain names during startup, and then they will never change the IP the backend points to until the proxy is completely restarted. With more ELB type backends cropping up, it's a frequently requested feature, but until recently it's been mostly non-existent in the main open source proxy options.
nginx just recently started supporting live DNS resolution in their commercial nginx plus offering. This is tempting, but this would slightly complicate the fully open source nature of API Umbrella. But something we should consider at least for api.data.gov's use case.
Our own work on this started when we were using haproxy for proxying. This feature is also on their radar, but still not implemented there. I do have some desire not to tie us too heavily to using nginx-specific features (or any proxies features), since I would like to consider switching back to haproxy for routing once they implement backend keep-alives (maybe in the next version).
nginx does support live DNS resolution without reloads in the open source version of nginx, but only if you don't use upstreams. This might be the best candidate for solving this. The downside is we can't use upstreams, so we could only use this when the backend is a single domain (so we couldn't support API Umbrella load balancing between two servers in this case). This largely seems reasonable to me, since these super dynamic backend domains are only used when the domain is handling load balancing itself. However, we also loose out on some additional opportunities to configure things with the lack of upstream, so we'd need to make sure we're not dependent on those.
If we stick with our current approach, it could be greatly simplified if we just always took the latest IP we've seen, and reload nginx then. The main reason these frequent reloads are problematic now is that our Rails web app is served out of the same nginx instance. So reloading nginx causes lots of Rails reloads, which causes slow response times as things spin back up. With the big upgrade happening in API Umbrella upgrade for production site #123, the instances of nginx have been split up, so this is no longer the case. This might make frequent reloads of nginx okay, but we'd need to do more testing to make sure this wouldn't negatively impact active connections.
If we fully embrace nginx as our proxy server, we could potentially do some interesting stuff to embed the DNS logic into the server with Lua and something like lua-resty-dns. However, even then, I'm not sure that would work, since the nginx lua upstream plugin doesn't currently support modifying the upstreams on the fly (but there's talk of that happening).

GUI · 2014-11-06T06:53:13Z

As an update on this front, I've been looking more into this recently. While our current DNS resolving does seem to be working, it's once of the more convoluted and complex pieces of code in the system, and the fact that we're still not completely respecting short-lived TTLs 100% of the time makes me nervous in cloud environments.

Not using nginx upstreams is still the simplest approach, but I remembered why we do want upstreams even in the case where we only have one server: backend keepalive support. With the simple dynamic proxy_pass approach, you lose the ability to configure things that the upstream blocks support like keepalives. In some environments, this might not be a big deal, but in our environment where we're proxying to remote servers in other data centers, this is actually more significant. I should put together more recent benchmarks to quantify this, but when I last looked into this, it definitely seemed like something we want to support if possible.

One new interesting new option I stumbled upon is this new-ish upstream_jdomain plugin for nginx. It seems to promise almost exactly what we need with asynchronous DNS lookups baked into nginx.

In my very early testing, the stumbling block I've run into is a broader issue with nginx, and that is that nginx won't start or reload if there are invalid hosts present. This is problematic for us, since we don't want prevent reloads or prevent the system from starting up again if we happen to have old hostnames present in the config (which has happened, since hosts agencies were testing with have been shutdown without being removed from our system). I can't really find a way around this without patching nginx. There are possibly some other creative solutions, but I'm afraid those would get too complicated quickly (temporarily modifying the /etc/hosts file during startup, defining a local dns server to resolve them, removing invalid hosts at startup and then polling to see if the eventually come back to life and we can add them back in). Or maybe we just remove any hosts that fail to resolve at startup and leave it at that (I'd just want to super careful to ensure this doesn't happen on a temporary lookup failure for a real hostname). Or maybe we just keep our current IP-resolving, auto-reloading approach. Any, still pondering all this...

In conjunction, I've also been exploring using dnsmasq as a local dns caching server. These tests at least bode well, and I think could be useful to provide some redundancy and local caching for DNS lookups.

GUI · 2015-01-21T01:30:00Z

The theoretical possibility of api.data.gov not properly updating for a short TTLed domain happened today (in this case, the underlying host was behind Akamai). So we definitely need to do something to better fix this.

Back in November, I began work on addressing the underlying issue so we can always acknowledge short TTLs. That work is mostly complete and is on the dns-overhaul branch of api-umbrella-router. However, it needs more testing.

The implementation I settled on in that branch is pretty straight forward: We reload nginx anytime there's any IP address change. Since we've decoupled the web component from this nginx instance, reloading frequently shouldn't have much of an impact, but I need to do more verification on that. In addition, one feature I had gotten rid of on that branch was falling back to the last known IP if DNS lookup fails. Based on some logs, I think we need to bring that functionality back to deal with the possibility of brief DNS outages.

GUI added this to the Sprint 14 (1/26-2/6) milestone Jan 26, 2015

GUI self-assigned this Jan 26, 2015

This was referenced Jan 26, 2015

Stopgap DNS fix for backend domains with short TTLs #177

Closed

Fix DNS edge-case issues NREL/api-umbrella-router#4

Merged

GUI added the api-umbrella-v0.7 label Feb 9, 2015

GUI closed this as completed Feb 9, 2015

This was referenced Feb 22, 2015

Server issue: Partial outage, too many files open #188

Closed

Fix duplicative server reloads caused by DNS changes #191

Closed

GUI mentioned this issue Oct 27, 2015

The grand Lua pull request NREL/api-umbrella#183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit how we're handling DNS resolution for API backend domains #131

Revisit how we're handling DNS resolution for API backend domains #131

GUI commented Oct 2, 2014

GUI commented Nov 6, 2014

GUI commented Jan 21, 2015

Revisit how we're handling DNS resolution for API backend domains #131

Revisit how we're handling DNS resolution for API backend domains #131

Comments

GUI commented Oct 2, 2014

GUI commented Nov 6, 2014

GUI commented Jan 21, 2015