-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit how we're handling DNS resolution for API backend domains #131
Comments
As an update on this front, I've been looking more into this recently. While our current DNS resolving does seem to be working, it's once of the more convoluted and complex pieces of code in the system, and the fact that we're still not completely respecting short-lived TTLs 100% of the time makes me nervous in cloud environments. Not using nginx upstreams is still the simplest approach, but I remembered why we do want upstreams even in the case where we only have one server: backend keepalive support. With the simple dynamic proxy_pass approach, you lose the ability to configure things that the upstream blocks support like keepalives. In some environments, this might not be a big deal, but in our environment where we're proxying to remote servers in other data centers, this is actually more significant. I should put together more recent benchmarks to quantify this, but when I last looked into this, it definitely seemed like something we want to support if possible. One new interesting new option I stumbled upon is this new-ish upstream_jdomain plugin for nginx. It seems to promise almost exactly what we need with asynchronous DNS lookups baked into nginx. In my very early testing, the stumbling block I've run into is a broader issue with nginx, and that is that nginx won't start or reload if there are invalid hosts present. This is problematic for us, since we don't want prevent reloads or prevent the system from starting up again if we happen to have old hostnames present in the config (which has happened, since hosts agencies were testing with have been shutdown without being removed from our system). I can't really find a way around this without patching nginx. There are possibly some other creative solutions, but I'm afraid those would get too complicated quickly (temporarily modifying the /etc/hosts file during startup, defining a local dns server to resolve them, removing invalid hosts at startup and then polling to see if the eventually come back to life and we can add them back in). Or maybe we just remove any hosts that fail to resolve at startup and leave it at that (I'd just want to super careful to ensure this doesn't happen on a temporary lookup failure for a real hostname). Or maybe we just keep our current IP-resolving, auto-reloading approach. Any, still pondering all this... In conjunction, I've also been exploring using dnsmasq as a local dns caching server. These tests at least bode well, and I think could be useful to provide some redundancy and local caching for DNS lookups. |
The theoretical possibility of api.data.gov not properly updating for a short TTLed domain happened today (in this case, the underlying host was behind Akamai). So we definitely need to do something to better fix this. Back in November, I began work on addressing the underlying issue so we can always acknowledge short TTLs. That work is mostly complete and is on the dns-overhaul branch of api-umbrella-router. However, it needs more testing. The implementation I settled on in that branch is pretty straight forward: We reload nginx anytime there's any IP address change. Since we've decoupled the web component from this nginx instance, reloading frequently shouldn't have much of an impact, but I need to do more verification on that. In addition, one feature I had gotten rid of on that branch was falling back to the last known IP if DNS lookup fails. Based on some logs, I think we need to bring that functionality back to deal with the possibility of brief DNS outages. |
In light of #129 and #130, I think the overall way we're handling DNS needs to be revisited at some point. I'm hoping we've nailed down most of these weird edge-cases, so our current system can keep humming along, but the way we're handling DNS lookups for API backends does seem over-complicated.
A quick summary of the current approach is that we have a process that resolves all the hostnames of API backend servers. When changes are detected it writes a new nginx config file referencing the IP address everything currently resolves to and reloads nginx (so nginx only references IPs, not hostnames).
This approach has become more complicated over time, because we make an attempt to stick with the current IP for a given host, rather than switching to a new IP as soon as we see it. This is done predominately to deal with backends like Akamai or ELBs, where the DNS resolves to a rotating collection of IPs. If we were to immediately acknowledge the IP seen in these cases, we would basically be reloading nginx every minute or two, due to the super-short TTL on those domains and the rotation of IPs. So generally speaking, we respect DNS TTLs, except for some of these CDN services, where the DNS seems to resolve to a rotating (but stable) collection of active IPs.
All of this complexity is obviously bad, since it's led to these weird edge-case bugs recently, so I'd like to streamline it.
Here's some additional random thoughts, ideas, or reasons we ended up where we are:
The text was updated successfully, but these errors were encountered: