-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul should handle nodes changing IP addresses #1580
Comments
Awesome! Ideally it would just handle the IP address change, but again, I'd be totally fine with it just falling over and dying for now, and letting whoever started the process handle starting it back up again. Right now it's just broken, it advertises services incorrectly, which is a pretty big ouch. For people to lazy to follow the link to the google group: My workaround for now is to have dhclient (as an exit hook) restart consul. |
That would fit my needs. We are running consul-agents via docker (docker-machine) and all machines are retrieving their IPs via DHCP. Docker machine uses the boot2docker image where it is nearly impossible to use those hooks. I start the container with the prefered IP address (-advertise) but when the machine restarts it may have a new ip address. That would result in incorrect DNS responses. |
The dhclient hook is a great workaround for Linux-based (non-Dockerized) environments, but I haven't been able to find an analogous workaround for Windows. Implementing a change within Consul (and Raft) would be incredible. |
Does closing #457 in favor of this really move it from the 0.8.2 timeframe to 0.9.x, or are they 2 segments of the same backlog? Is there some sort of roadmap explanation that benefits from a single issue & thus won't have to be duplicated to the above 6 issue? |
@sweeneyb I had actually meant to tag this to 0.8.2 (moved it back there), though given our backlog we may not be able to fully finish this off in time for that release. It seemed better to manage this as a single unit vs. a bunch of similar tickets - this will likely end up with a checklist of things to burn down, which'll be easier to keep track of. |
Thanks. You guys iterate fast, so a slip of a few minor versions seems reasonable. I was just hoping it would be in the 0.8.x timeframe. And again, if there is an approach from any of the discussions that's being favored, that would be great to know. There have been a few fixes proposed, but I don't have as much context to figure out where raft & consul are aiming. -- Thanks for the response. |
Yeah now that we've got Raft using UUIDs for quorum management (if you are using Raft protocol 3) then I think the remaining work is up at the higher level to make sure the Serf-driven parts can properly handle the IP changes for a node of the same name. There might be some work to get the catalog to properly update as well (that also has the UUID knowledge, but still indexes by node name for everything). Honestly, it might take a few more iterations to iron out all the details, but we are moving in the right direction. |
Hi, we have been running a script based on the solution stated in the doc for disaster recovery, creating the peers.json with the changed IPs before starting the agent. I am wondering if this still works after UUID been introduced. |
Thanks @hehailong5 I think we missed that one so I opened #3003 so we can get that fixed right away. |
Also impacted by this issue. Restarting consul agent does not solve the situation but lead to agent seen as failed and consul servers see:
Also tried to (using consul 0.7.3 though) |
I made some experiment: did an IP address change.
The
IP address:
I am changing IP address via:
Cluster members list reports old info:
If I restart the
I see a good state of cluster:
|
As @slackpad pointed out in his comment this will work only as long as the majority of the servers stay alive to maintain quorum. Would it be possible to refer to consul nodes by DNS as well as an IP? This was raised and refused in #1185, but couldn't it be a relatively painless solution? If all the nodes restarted and came back with a different IP, but the same DNS name as they previously advertised the nodes coming back could still connect to each other without having to update the configuration/catalog (wherever consul stores this information). Or is there some alternative way where even the majority/the entire cluster could go offline and come back with changed IPs and still be able to recover without manually having to perform outage recovery? |
Bash script that checks IP address changes and restarts Consul... |
Hey all, I'm trying to test this out but the initial cluster is not electing a leader. Here is my code: Here is the log from the
Also, @preetapan, quoting the
|
@erkolson that was a typo, edited it to fix now. can you try adding bootstrap-expect=3 when you start consul? Here's my orchestration script that uses docker where I tested terminating all servers and starting them back up with new ips |
@erkolson I think you also need to set |
Thanks, I added This is the initial cluster:
After getting the pods to start with new IPs, I see this:
The data is still there,
Logs from consul-test-0
|
@erkolson Is the cluster operational otherwise, and you are able to use it for service registeration/KV writes etc? The wrong IP address issue you mentioned above might be a temporary sync issue that affects the output of |
Indeed, I'll leave it running for a bit longer to see if the peers list reconciles. So far, ~30 minutes, no change. |
Although at the moment I have no logs to show for it, I had the exact same problem when running the master branch. |
@erkolson Do you mind trying the same test with 5 instead of 3 servers? I have a fix in the works for making this work, the root cause is that autopilot will not do the config fix for the server with the wrong IP because that's going to cause it to lose quorum. Please let me know if you still see the problem with 5 servers.. |
@preetapan, I ran the test again with 5 servers and this time Initial cluster:
Intermediate step after pods recreated:
And finally, ~40s after startup:
Looks good! |
@erkolson Thanks for your help in testing this, we really appreciate it! |
We definitely appreciate all the help testing this. We cut a build with the fix @preetapan added via #3450 in https://releases.hashicorp.com/consul/0.9.3-rc2/. If you can give that a look please let us know if you see any remaining issues. |
I tested again with 3 nodes and rc2. This time it took ~2 min after startup with new IPs for the peers list to reconcile but all seems to be working. You are welcome for the help, I'm happy to see this functionality. I have experienced first hand all consul pods being rescheduled simultaneously a couple months ago :-) |
On 0.9.3. Still seems to have cluster leader problem. |
@faheem-cliqz We will need more specific information before we can figure out what could have happened with your setup.
|
So, I not found a way how can Consul handle IP address change in "runtime". To test this, run Consul as a Docker container via the docker-compose.yml file:
|
@AlexeySofree to get it working I had to add '-raft-protocol 3' to my container run command. More info here: https://www.consul.io/docs/agent/options.html#_raft_protocol |
I upgraded a cluster today from 0.8.5 -> 0.9.3. Everything was already using raft 3. The rolling update happened much faster than I expected and there was not enough time in between each node being killed/restarted on the new version for the cluster to elect a leader. Even still, the cluster was able to elect a new leader once everything settled down. I was also using No data was lost |
@preetapan Sorry for the delayed response. Followed your suggestions of shifting to raft 3. Leader election works properly now after a rolling deployment. There was no data loss as well :) I was using a customized helm chart for consul from here. Currently on consul 0.9.3. |
Thought this was captured but couldn't find an existing issue for this. Here's a discussion - https://groups.google.com/d/msgid/consul-tool/623398ba-1dee-4851-85a2-221ff539c355%40googlegroups.com?utm_medium=email&utm_source=footer. For servers we'd also need to address #457.
We are going to close other IP-related issues against this one to keep everything together. The Raft side should support this once you get to Raft protocol version 3, but we need to do testing and will likely have to burn down some small issues to complete this.
The text was updated successfully, but these errors were encountered: