-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with overlay network and DNS resolution between containers #30487
Comments
Hi, Any news about this issue? |
@ggaugry Are you using swarm mode with |
@sanimej : I use swarm mode with attachable option. |
I still have the problem. Results of a dig for DNS "mysql": root@24c350f685ef:/home/nightmare# dig mysql ; <<>> DiG 9.9.5-9+deb8u10-Debian <<>> mysql ;; QUESTION SECTION: ;; ANSWER SECTION: ;; Query time: 0 msec |
I think I ran into this one myself. Overlay networks were managed by swarm mode and attachable. We migrated containers from running via docker-compose and classic swarm to deploying services in swarm mode. After this migration, from one node it was resolving what I believe was the old address along with the new VIP address. The other node only resolved the new VIP address. Recreating the container that happened to have the invalid IP and the service with the DNS name we were trying to correct (using Our Docker version:
Our servers are running RHEL 7.2 with the 3.10.0-327 kernel. Unfortunately I don't believe we can reliably recreate this, and it happened in production during a limited outage window so I didn't have time to gather logs. As best I can tell, there are situations where stopping a container on an attachable overlay network doesn't propagate the change the other nodes in the swarm. |
Hi, |
@danielmrosa 17.06 has a lot of patches and should work much better all-around wrt service discovery and networking. |
@cpuguy83 |
Hi all Still the same issue with docker 17.06.01-ce: Exemple inside one container: root@b617061009ee:/# getent hosts bazoocaster-sfr0092vu When does it happens?
This is the scenario where we see sometimes these DNS problems (most of the times it works well, but sometimes the DNS resolution goes crazy) |
The only way to fix this DNS problem is to remove the problematic node (the one which appears to have 2 Ips) from the Swarm and rejoin it again. |
Hi All, |
@ggaugry I still have to go through the rename part, that can create issues. @danielmrosa can you share something more about it? Do you have a set of steps that are consistently reproducing the problem? |
@fcrisciani : thanks for the answer. I was saying earlier that remove/re-join the Swarm was a workaround but it actually doesn't work. Do you have any idea on how I can clean up the wrong DNS entries? |
FYI: The 2 IPs shown are the 2 IPs actually used by the containers running on the target machine: |
@fcrisciani , Thanks for your answer. When the problem occurs, one task resolve name to 2 IP´s. Even if we destroy the task related to the second IP, this IP does not leave DNS database quickly. I can´t figured out in what situation this problem occurs, sorry. |
@fcrisciani : we removed the step to rename the container in _OLD to test. We still have the problem on a clean install. |
@ggaugry in this output: #30487 (comment) you did the rename of the container and the DNS did not get updated correct? I can see the 2 names that are different. I'm trying to narrow down which are the set of the steps to reproduce easily to debug it. @danielmrosa in your case is a permanent failure or is a transient one? |
@fcrisciani : yes . Step you could try to reproduce:
Now, if you get into containers on managers node (test_dns1, test_dns2, test_dns3) and run command "getent hosts test", you will probably get 2 IPs. |
@ggaugry ok thanks for the steps will try to take a look and will update here if I find something |
hi, just encountered the same problem with docker version 18.02: pi@raspberrypi:~ $ docker version Server: I tried with both custom network and the default ingress one. All my dockers are on the same lan with no firewall between them and only one IPv4 and one IPv6 on each host. 2 dockers are running, one nginx and one nextcloud. even when they are on the same host they doesn't seems to see each other |
Hi @fcrisciani, I work with @danielmrosa and today we encountered a similar problem in our docker swarm cluster. First, to answer your last question to Daniel, this isn't a permanent issue. We noticed that 2 services on docker swarm cluster are using the same VIP. Relevant pieces of docker inspect: docker inspect service1:
......................................... docker inspect service2:
......................................... Docker version: Client: Server: We found this problem because we are trying to reach service2 at port 80, but we were reaching service1 instead. We don't know yet how to reproduce this issue. @fcrisciani any recommended action? Edited: moby/swarmkit#2474 we found this PR on 17.12.0-ce release notes, is this the same problem? |
@wrg02 We also recently found another issue in the IPAM that is fixed here: moby/libnetwork#2105 Unfortunately with the current timeline not sure if the fix will make it for 18.03, but will definitely going to be included in the next release. |
@fcrisciani We are happy to know that this will be fixed soon. We will post here if we find another related issue. |
Hi @fcrisciani This problem are blocking us to move on production using swarm mode. Do you have some recomendation ? Thanks in advance! |
+1 |
I'm also having this issue when using docker swarm mode.
Service records (docker DNS) sometimes end up with old IP addresses from the previous service. |
@viniciusramosdefaria @2416ryan please guys let's not create another issue where we just post information that are not useful for debug purpose. If you have steps of hints on how to reproduce the condition please share them |
@danielmrosa do you have any way to reproduce? First thing to check is the network inspect If that is not the case I would start taking a look to the network db state:
netPeers should match the number of nodes that have container on that network, entries is the number of entries in the database (it's not 1:1 with the containers), qLen should be always be 0 when the system is stable and will spike only when there is changes in the cluster. If you use the diagnostic tool, you can also identify who was the node owner of the extra entry and track back with the last grep if the node left the cluster at some point and why the cleanup did not happen. |
Hi @fcrisciani Please, tell me if there is a way to use consul as a DNS discovery to registrar container IP´s on consul KV store. AFAIK, it seems that is not possible on swarm mode, but may be I´m wrong. |
@danielmrosa happy to help if you can share more info, as I was telling you before, we are not seeing users reporting the instability that you are experiencing so it may be something in your environment. Check that the TCP/UDP 7946 ports for networkDB are open on your nodes and you can try with a brand new network to start from a 100% clean state eventually. For what concerns consul, you can but you will have to handle it as a separate container on your side, there is no automatic integration to decide the backend. |
Hi @fcrisciani Maybe one way to reproduce this problem is try to create many problematic containers and let then restart by itself, do many service updates using some problematic tags and see a mess happens :-) Just an information, we use an overlay network using /22. I saw on documentation that it´s not recommended to use overlay network greater than /24 due some overlay instability, and if you need more than 256 IP´s, it´s better to use DNSRR. Do you confirm, is it true? |
hey @danielmrosa
we have internally actually a test that does exactly that. It spawn a bunch of services with a wrong command so that they will stay and spin with containers coming up and exiting immediately. I will take a look just to be sure on Monday. Regarding the overlay, the limitation is performance and time that take to spin up services. |
Some feedback that may be useful: I migrated a node to swarm and kept some other services managed by the old docker-compose, probably a conceptual error. This made those dockers appear as a "null" service from the networking perspective (docker network inspect -v), so they were not being cleaned-up accordingly. Hope this helps! |
I'm also having this issue. |
@bigfoot90 do you have a repro? |
I have a VPN between hosts (10.8.3.0/24) Steps to reproduce the issue:
|
@bigfoot90 most likely in your configuration the networkDB is not able to communicate and distribute the information across cluster. |
After about 3 hours of trying different docker versions and deploying containers, Can someone confirm if this is a coincidence? |
@bigfoot90 Grep that string in the logs |
Logs are full of this: Manager node
Worker node
|
After some days, I can confirm that Docker 17.12.1 is the last version that works correctly. |
@bigfoot90 I confirm that I am able to reproduce the same bug (I found the same steps by myself and then this issue on GitHub) on Actually
In this example nginx was stopped. Starting it again will result in addition of new correct entry, while broken old one will stay in place. However stopping of all containers in project (but not through |
@Oleg-Arkhipov @bigfoot90 |
okay, let's close |
Description
I noticed some problems with DNS resolution inside a overlay network and Swarm.
DNS entries are not always updated automatically.
I have 10 containers over 4 hosts on Ubuntu 16.04 connected by Swarm. I created an overlay network for those containers.
When I redeploy one of those containers (I stop the current one, rename it to OLD, and create a new one with the same name), the container will not always have the same IP as before (which is not a problem). But It looks like it does not always update the DNS entry for the others containers in the network. The new created container is then unreachable from other one.
My docker version is 1.13.0.
Steps to reproduce the issue:
Describe the results you received:
If IP of this new container has changed, the DNS entry will not be updated automatically for others containers. If you try to ping this new container dns name from others containers, sometimes you will notice that the resolved IP is actually the IP of the previous removed container.
Describe the results you expected:
DNS entries should be updated for every containers when these last ones have their IP changed.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
The text was updated successfully, but these errors were encountered: