-
Notifications
You must be signed in to change notification settings - Fork 844
Description
We are seeing an issue with hostdb in our fork (ATS 8.1.2 plus some cherry-picks from 9.x branch).
Very rarely in a production environment, we see a hostname get "stuck" in hostdb such that it never resolves again. We see a fairly constant string of "delaying force 0 answer for [timeout 0]" messages, and then after 30 seconds hostdb times out and the cache returns a 502 error to the client. Never during this 30 seconds do we see a request go down to the (dns) level for actual resolution. All the nameservers are functioning properly and every other hostname is being properly resolved. A "dig" command of the same fqdn from the command line on that server also properly resolves, so it does not appear to be a DNS issue. We have seen this maybe 4 times over the past couple months on only 1 or 2 servers out of a deployment of several hundred caches, so it does not happen often. It's never the same hostname that gets "stuck".
An ATS restart is required to clear the condition.
Another user (Nir Finkel) reported the same symptoms in the slack channel as well, using the 9.0.2 release of ATS.