Consul should handle nodes changing IP addresses #1580

slackpad · 2016-01-09T02:12:07Z

Thought this was captured but couldn't find an existing issue for this. Here's a discussion - https://groups.google.com/d/msgid/consul-tool/623398ba-1dee-4851-85a2-221ff539c355%40googlegroups.com?utm_medium=email&utm_source=footer. For servers we'd also need to address #457.

We are going to close other IP-related issues against this one to keep everything together. The Raft side should support this once you get to Raft protocol version 3, but we need to do testing and will likely have to burn down some small issues to complete this.

csawyerYumaed · 2016-01-09T03:33:44Z

Awesome! Ideally it would just handle the IP address change, but again, I'd be totally fine with it just falling over and dying for now, and letting whoever started the process handle starting it back up again. Right now it's just broken, it advertises services incorrectly, which is a pretty big ouch.

For people to lazy to follow the link to the google group: My workaround for now is to have dhclient (as an exit hook) restart consul.

r0bnet · 2016-01-14T12:11:41Z

That would fit my needs. We are running consul-agents via docker (docker-machine) and all machines are retrieving their IPs via DHCP. Docker machine uses the boot2docker image where it is nearly impossible to use those hooks. I start the container with the prefered IP address (-advertise) but when the machine restarts it may have a new ip address. That would result in incorrect DNS responses.
Currently I'm looking for a workaround but i can't (yet) see a solution that will work without too much effort.
Probably it would be necessary to tell consul which network interface to use. Consul could then determine the correct ip address.

jsullivan3 · 2016-01-20T16:49:18Z

The dhclient hook is a great workaround for Linux-based (non-Dockerized) environments, but I haven't been able to find an analogous workaround for Windows. Implementing a change within Consul (and Raft) would be incredible.

sweeneyb · 2017-05-03T22:09:25Z

Does closing #457 in favor of this really move it from the 0.8.2 timeframe to 0.9.x, or are they 2 segments of the same backlog? Is there some sort of roadmap explanation that benefits from a single issue & thus won't have to be duplicated to the above 6 issue?

slackpad · 2017-05-03T22:34:12Z

@sweeneyb I had actually meant to tag this to 0.8.2 (moved it back there), though given our backlog we may not be able to fully finish this off in time for that release. It seemed better to manage this as a single unit vs. a bunch of similar tickets - this will likely end up with a checklist of things to burn down, which'll be easier to keep track of.

sweeneyb · 2017-05-03T22:57:36Z

Thanks. You guys iterate fast, so a slip of a few minor versions seems reasonable. I was just hoping it would be in the 0.8.x timeframe.

And again, if there is an approach from any of the discussions that's being favored, that would be great to know. There have been a few fixes proposed, but I don't have as much context to figure out where raft & consul are aiming. -- Thanks for the response.

slackpad · 2017-05-03T23:18:30Z

Yeah now that we've got Raft using UUIDs for quorum management (if you are using Raft protocol 3) then I think the remaining work is up at the higher level to make sure the Serf-driven parts can properly handle the IP changes for a node of the same name. There might be some work to get the catalog to properly update as well (that also has the UUID knowledge, but still indexes by node name for everything). Honestly, it might take a few more iterations to iron out all the details, but we are moving in the right direction.

hehailong5 · 2017-05-04T00:50:07Z

Hi, we have been running a script based on the solution stated in the doc for disaster recovery, creating the peers.json with the changed IPs before starting the agent. I am wondering if this still works after UUID been introduced.

slackpad · 2017-05-04T01:38:25Z

Thanks @hehailong5 I think we missed that one so I opened #3003 so we can get that fixed right away.

kamaradclimber · 2017-06-08T17:17:38Z

Also impacted by this issue.

Restarting consul agent does not solve the situation but lead to agent seen as failed and consul servers see:

un 08 17:07:57 consul01-par consul[20698]: 2017/06/08 17:07:57 [ERR] memberlist: Conflicting address for e4-1d-2d-1d-07-90.pa4.hpc.criteo.prod. Mine: 10.224.11.18:8301 Theirs: 10.224.11.73:8301
Jun 08 17:07:57 consul01-par consul[20698]: 2017/06/08 17:07:57 [WARN] serf: Name conflict for 'e4-1d-2d-1d-07-90.pa4.hpc.criteo.prod' both 10.224.11.18:8301 and 10.224.11.73:8301 are claiming

Also tried to consul leave on the agent without effect (member is seen as left but we still have the same error messages).

(using consul 0.7.3 though)

Alexey-Tsarev · 2017-06-25T17:40:44Z

I made some experiment: did an IP address change.
Let's demonstrate this.
Working cluster with 3 servers:

/root/temp/consul/consul members
Node           Address              Status  Type    Build  Protocol  DC
li.home.local  x.29.111.207:8301   alive   server  0.8.4  3         dc1
rhost.local    y.201.41.69:8301    alive   server  0.8.4  3         dc1
thost.net      z.234.37.183:8301   alive   server  0.8.4  3         dc1

/root/temp/consul/consul operator raft list-peers
Node           ID                                    Address              State     Voter  RaftProtocol
thost.net      dd6fbbf2-bead-fe16-a37b-2208e8bd8234  z.234.37.183:8300   leader    true   3
li.home.local  510841b8-7d15-dc78-a2b0-5dfaf72c23b0  x.29.111.207:8300   follower  true   3
rhost.local    0296831c-aa74-b47e-ded5-75450aff8943  y.201.41.69:8300    follower  true   3

The li.home.local uses this command for Consul running:

/root/temp/consul/consul agent -data-dir=/root/temp/consul/data -server -raft-protocol=3 -protocol=3 -advertise=x.29.111.207

IP address:

ip a | grep inet | grep ppp0
    inet x.29.111.207 peer a.b.c.d/32 scope global ppp0

I am changing IP address via:

ifdown ppp0; ifup ppp0

ip a | grep inet | grep ppp0
    inet o.29.106.101 peer a.b.c.d/32 scope global ppp0

Cluster members list reports old info:

/root/temp/consul/consul members
Node           Address              Status  Type    Build  Protocol  DC
li.home.local  x.29.111.207:8301   failed  server  0.8.4  3         dc1
rhost.local    y.201.41.69:8301    alive   server  0.8.4  3         dc1
thost.net      z.234.37.183:8301   alive   server  0.8.4  3         dc1

If I restart the li.home.local node with new IP address advertised:

/root/temp/consul/consul agent -data-dir=/root/temp/consul/data -server -raft-protocol=3 -protocol=3 -advertise=o.29.106.101

I see a good state of cluster:

/root/temp/consul/consul members
Node           Address              Status  Type    Build  Protocol  DC
li.home.local  o.29.106.101:8301   alive   server  0.8.4  3         dc1
rhost.local    y.201.41.69:8301    alive   server  0.8.4  3         dc1
thost.net      z.234.37.183:8301   alive   server  0.8.4  3         dc1

mitom · 2017-06-26T08:25:46Z

As @slackpad pointed out in his comment this will work only as long as the majority of the servers stay alive to maintain quorum.

Would it be possible to refer to consul nodes by DNS as well as an IP? This was raised and refused in #1185, but couldn't it be a relatively painless solution? If all the nodes restarted and came back with a different IP, but the same DNS name as they previously advertised the nodes coming back could still connect to each other without having to update the configuration/catalog (wherever consul stores this information).

Or is there some alternative way where even the majority/the entire cluster could go offline and come back with changed IPs and still be able to recover without manually having to perform outage recovery?

Alexey-Tsarev · 2017-06-26T09:02:50Z

... is there some alternative... without manually having to perform outage recovery?

Bash script that checks IP address changes and restarts Consul...

erkolson · 2017-09-06T13:35:17Z

Hey all, I'm trying to test this out but the initial cluster is not electing a leader.

Here is my code:
https://github.com/erkolson/consul-v0.9.3-rc1-test

Here is the log from the consul-test-0 pod:

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v0.9.3-rc1-rc1 (d62743c)'
           Node ID: 'b84a7750-dcdb-c63a-1ae8-2ef036731c81'
         Node name: 'consul-test-0'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 10.37.84.6 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/09/06 13:30:24 [INFO] raft: Initial configuration (index=0): []
    2017/09/06 13:30:24 [INFO] raft: Node at 10.37.84.6:8300 [Follower] entering Follower state (Leader: "")
    2017/09/06 13:30:24 [INFO] serf: EventMemberJoin: consul-test-0.dc1 10.37.84.6
    2017/09/06 13:30:24 [INFO] serf: EventMemberJoin: consul-test-0 10.37.84.6
    2017/09/06 13:30:24 [INFO] consul: Handled member-join event for server "consul-test-0.dc1" in area "wan"    2017/09/06 13:30:24 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
    2017/09/06 13:30:24 [INFO] agent: Joining LAN cluster...
    2017/09/06 13:30:24 [INFO] agent: (LAN) joining: [10.37.84.6 10.36.180.8 10.33.92.6]

    2017/09/06 13:30:24 [INFO] consul: Adding LAN server consul-test-0 (Addr: tcp/10.37.84.6:8300) (DC: dc1)
    2017/09/06 13:30:24 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2017/09/06 13:30:24 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2017/09/06 13:30:24 [INFO] agent: Started HTTP server on [::]:8500
    2017/09/06 13:30:24 [INFO] serf: EventMemberJoin: consul-test-2 10.33.92.6
    2017/09/06 13:30:24 [INFO] serf: EventMemberJoin: consul-test-1 10.36.180.8
    2017/09/06 13:30:24 [INFO] consul: Adding LAN server consul-test-2 (Addr: tcp/10.33.92.6:8300) (DC: dc1)
    2017/09/06 13:30:24 [INFO] consul: Adding LAN server consul-test-1 (Addr: tcp/10.36.180.8:8300) (DC: dc1)
    2017/09/06 13:30:24 [INFO] serf: EventMemberJoin: consul-test-2.dc1 10.33.92.6
    2017/09/06 13:30:24 [INFO] serf: EventMemberJoin: consul-test-1.dc1 10.36.180.8
    2017/09/06 13:30:24 [INFO] consul: Handled member-join event for server "consul-test-2.dc1" in area "wan"
    2017/09/06 13:30:24 [INFO] consul: Handled member-join event for server "consul-test-1.dc1" in area "wan"
    2017/09/06 13:30:24 [INFO] agent: (LAN) joined: 3 Err: <nil>
    2017/09/06 13:30:24 [INFO] agent: Join LAN completed. Synced with 3 initial agents
    2017/09/06 13:30:30 [WARN] raft: no known peers, aborting election
    2017/09/06 13:30:31 [ERR] agent: failed to sync remote state: No cluster leader

Also, @preetapan, quoting the 3 for raft_protocol in server.json causes an error:

[consul-test-0] * 'raft_protocol' expected type 'int', got unconvertible type 'string'

preetapan · 2017-09-06T14:02:46Z

@erkolson that was a typo, edited it to fix now.

can you try adding bootstrap-expect=3 when you start consul? Here's my orchestration script that uses docker where I tested terminating all servers and starting them back up with new ips

slackpad · 2017-09-06T14:41:29Z

@erkolson I think you also need to set bootstrap_expect in https://github.com/erkolson/consul-v0.9.3-rc1-test/blob/master/manifests/consul-test-config.yaml to the number of servers you are running to get the cluster to initially bootstrap.

erkolson · 2017-09-06T15:00:06Z

Thanks, I added boostrap-expect to the exec command and the cluster initializes. It took a while to figure out how to recreate the pods with new IP addresses...

This is the initial cluster:

Node           ID                                    Address           State     Voter  RaftProtocol
consul-test-1  7931eb2f-3e44-831e-acff-d8345ad345ae  10.36.180.8:8300  leader    true   3
consul-test-0  b84a7750-dcdb-c63a-1ae8-2ef036731c81  10.37.84.6:8300   follower  true   3
consul-test-2  29f263d1-e7b5-e905-13b1-931f7968cb3e  10.33.92.6:8300   follower  true   3

After getting the pods to start with new IPs, I see this:

Node           ID                                    Address           State     Voter  RaftProtocol
(unknown)      b84a7750-dcdb-c63a-1ae8-2ef036731c81  10.37.84.6:8300   follower  true   <=1
consul-test-2  29f263d1-e7b5-e905-13b1-931f7968cb3e  10.33.92.13:8300  follower  true   3
consul-test-1  7931eb2f-3e44-831e-acff-d8345ad345ae  10.37.92.8:8300   follower  true   3

The data is still there, consul kg get -recurse shows the keys I set prior to restarting but the previous IP address of the consul-test-0 pod did not get updated. These are the current pod addresses:

NAME                        READY     STATUS    RESTARTS   AGE       IP
consul-test-0               1/1       Running   0          7m        10.36.204.14 
consul-test-1               1/1       Running   0          7m        10.37.92.8
consul-test-2               1/1       Running   0          7m        10.33.92.13

Logs from consul-test-0

[consul-test-0] ==> WARNING: Expect Mode enabled, expecting 3 servers
[consul-test-0] ==> Starting Consul agent...
[consul-test-0] ==> Consul agent running!
[consul-test-0] Version: 'v0.9.3-rc1-rc1 (d62743c)'
[consul-test-0] Node ID: 'b84a7750-dcdb-c63a-1ae8-2ef036731c81'
[consul-test-0] Node name: 'consul-test-0'
[consul-test-0] Datacenter: 'dc1' (Segment: '<all>')
[consul-test-0] Server: true (Bootstrap: false)
[consul-test-0] Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 8600)
[consul-test-0] Cluster Addr: 10.36.204.14 (LAN: 8301, WAN: 8302)
[consul-test-0] Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
[consul-test-0]
[consul-test-0] ==> Log data will now stream in as it occurs:
[consul-test-0]
[consul-test-0] 2017/09/06 14:44:32 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:7931eb2f-3e44-831e-acff-d8345ad345ae Address:10.36.180.8:8300} {Suffrage:Voter ID:29f263d1-e7b5-e905-13b1-931f7968cb3e Address:10.33.92.6:8300} {Suffrage:Voter ID:b84a7750-dcdb-c63a-1ae8-2ef036731c81 Address:10.37.84.6:8300}]
[consul-test-0] 2017/09/06 14:44:32 [INFO] raft: Node at 10.36.204.14:8300 [Follower] entering Follower state (Leader: "")
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: EventMemberJoin: consul-test-0.dc1 10.36.204.14
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: Attempting re-join to previously known node: consul-test-2.dc1: 10.33.92.6:8302
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: Attempting re-join to previously known node: consul-test-1.dc1: 10.36.180.8:8302
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: EventMemberJoin: consul-test-0 10.36.204.14
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: Attempting re-join to previously known node: consul-test-1: 10.36.180.8:8301
[consul-test-0] 2017/09/06 14:44:32 [INFO] consul: Adding LAN server consul-test-0 (Addr: tcp/10.36.204.14:8300) (DC: dc1)
[consul-test-0] 2017/09/06 14:44:32 [INFO] consul: Raft data found, disabling bootstrap mode
[consul-test-0] 2017/09/06 14:44:32 [INFO] consul: Handled member-join event for server "consul-test-0.dc1" in area "wan"
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
[consul-test-0] 2017/09/06 14:44:32 [WARN] serf: Failed to re-join any previously known node
[consul-test-0] 2017/09/06 14:44:32 [WARN] serf: Failed to re-join any previously known node
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: Started HTTP server on [::]:8500
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: Joining LAN cluster...
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: (LAN) joining: [10.36.204.14 10.37.92.8 10.33.92.13]
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: EventMemberJoin: consul-test-2 10.33.92.13
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: EventMemberJoin: consul-test-1 10.37.92.8
[consul-test-0] 2017/09/06 14:44:32 [INFO] consul: Adding LAN server consul-test-2 (Addr: tcp/10.33.92.13:8300) (DC: dc1)
[consul-test-0] 2017/09/06 14:44:32 [INFO] consul: Adding LAN server consul-test-1 (Addr: tcp/10.37.92.8:8300) (DC: dc1)
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: EventMemberJoin: consul-test-2.dc1 10.33.92.13
[consul-test-0] 2017/09/06 14:44:32 [INFO] serf: EventMemberJoin: consul-test-1.dc1 10.37.92.8
[consul-test-0] 2017/09/06 14:44:32 [INFO] consul: Handled member-join event for server "consul-test-2.dc1" in area "wan"
[consul-test-0] 2017/09/06 14:44:32 [INFO] consul: Handled member-join event for server "consul-test-1.dc1" in area "wan"
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: (LAN) joined: 3 Err: <nil>
[consul-test-0] 2017/09/06 14:44:32 [INFO] agent: Join LAN completed. Synced with 3 initial agents
[consul-test-0] 2017/09/06 14:44:38 [WARN] raft: Heartbeat timeout from "" reached, starting election
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Node at 10.36.204.14:8300 [Candidate] entering Candidate state in term 8
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Election won. Tally: 2
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Node at 10.36.204.14:8300 [Leader] entering Leader state
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Added peer 7931eb2f-3e44-831e-acff-d8345ad345ae, starting replication
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Added peer 29f263d1-e7b5-e905-13b1-931f7968cb3e, starting replication
[consul-test-0] 2017/09/06 14:44:38 [INFO] consul: cluster leadership acquired
[consul-test-0] 2017/09/06 14:44:38 [INFO] consul: New leader elected: consul-test-0
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: pipelining replication to peer {Voter 7931eb2f-3e44-831e-acff-d8345ad345ae 10.36.180.8:8300}
[consul-test-0] 2017/09/06 14:44:38 [WARN] raft: AppendEntries to {Voter 29f263d1-e7b5-e905-13b1-931f7968cb3e 10.33.92.6:8300} rejected, sending older logs (next: 178)
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: pipelining replication to peer {Voter 29f263d1-e7b5-e905-13b1-931f7968cb3e 10.33.92.6:8300}
[consul-test-0] 2017/09/06 14:44:38 [INFO] consul: member 'consul-test-0' joined, marking health alive
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Updating configuration with RemoveServer (29f263d1-e7b5-e905-13b1-931f7968cb3e, ) to [{Suffrage:Voter ID:7931eb2f-3e44-831e-acff-d8345ad345ae Address:10.36.180.8:8300} {Suffrage:Voter ID:b84a7750-dcdb-c63a-1ae8-2ef036731c81 Address:10.37.84.6:8300}]
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Removed peer 29f263d1-e7b5-e905-13b1-931f7968cb3e, stopping replication after 184
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: aborting pipeline replication to peer {Voter 29f263d1-e7b5-e905-13b1-931f7968cb3e 10.33.92.6:8300}
[consul-test-0] 2017/09/06 14:44:38 [INFO] consul: removed server with duplicate ID: 29f263d1-e7b5-e905-13b1-931f7968cb3e
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Updating configuration with AddNonvoter (29f263d1-e7b5-e905-13b1-931f7968cb3e, 10.33.92.13:8300) to [{Suffrage:Voter ID:7931eb2f-3e44-831e-acff-d8345ad345ae Address:10.36.180.8:8300} {Suffrage:Voter ID:b84a7750-dcdb-c63a-1ae8-2ef036731c81 Address:10.37.84.6:8300} {Suffrage:Nonvoter ID:29f263d1-e7b5-e905-13b1-931f7968cb3e Address:10.33.92.13:8300}]
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Added peer 29f263d1-e7b5-e905-13b1-931f7968cb3e, starting replication
[consul-test-0] 2017/09/06 14:44:38 [WARN] raft: AppendEntries to {Nonvoter 29f263d1-e7b5-e905-13b1-931f7968cb3e 10.33.92.13:8300} rejected, sending older logs (next: 185)
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Updating configuration with RemoveServer (7931eb2f-3e44-831e-acff-d8345ad345ae, ) to [{Suffrage:Voter ID:b84a7750-dcdb-c63a-1ae8-2ef036731c81 Address:10.37.84.6:8300} {Suffrage:Nonvoter ID:29f263d1-e7b5-e905-13b1-931f7968cb3e Address:10.33.92.13:8300}]
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: pipelining replication to peer {Nonvoter 29f263d1-e7b5-e905-13b1-931f7968cb3e 10.33.92.13:8300}
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Removed peer 7931eb2f-3e44-831e-acff-d8345ad345ae, stopping replication after 186
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: aborting pipeline replication to peer {Voter 7931eb2f-3e44-831e-acff-d8345ad345ae 10.36.180.8:8300}
[consul-test-0] 2017/09/06 14:44:38 [INFO] consul: removed server with duplicate ID: 7931eb2f-3e44-831e-acff-d8345ad345ae
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Updating configuration with AddNonvoter (7931eb2f-3e44-831e-acff-d8345ad345ae, 10.37.92.8:8300) to [{Suffrage:Voter ID:b84a7750-dcdb-c63a-1ae8-2ef036731c81 Address:10.37.84.6:8300} {Suffrage:Nonvoter ID:29f263d1-e7b5-e905-13b1-931f7968cb3e Address:10.33.92.13:8300} {Suffrage:Nonvoter ID:7931eb2f-3e44-831e-acff-d8345ad345ae Address:10.37.92.8:8300}]
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: Added peer 7931eb2f-3e44-831e-acff-d8345ad345ae, starting replication
[consul-test-0] 2017/09/06 14:44:38 [INFO] consul: member 'consul-test-1' joined, marking health alive
[consul-test-0] 2017/09/06 14:44:38 [WARN] raft: AppendEntries to {Nonvoter 7931eb2f-3e44-831e-acff-d8345ad345ae 10.37.92.8:8300} rejected, sending older logs (next: 187)
[consul-test-0] 2017/09/06 14:44:38 [INFO] raft: pipelining replication to peer {Nonvoter 7931eb2f-3e44-831e-acff-d8345ad345ae 10.37.92.8:8300}
[consul-test-0] 2017/09/06 14:44:38 [INFO] agent: Synced node info
[consul-test-0] 2017/09/06 14:44:58 [INFO] raft: Updating configuration with AddStaging (29f263d1-e7b5-e905-13b1-931f7968cb3e, 10.33.92.13:8300) to [{Suffrage:Voter ID:b84a7750-dcdb-c63a-1ae8-2ef036731c81 Address:10.37.84.6:8300} {Suffrage:Voter ID:29f263d1-e7b5-e905-13b1-931f7968cb3e Address:10.33.92.13:8300} {Suffrage:Nonvoter ID:7931eb2f-3e44-831e-acff-d8345ad345ae Address:10.37.92.8:8300}]
[consul-test-0] 2017/09/06 14:44:58 [INFO] raft: Updating configuration with AddStaging (7931eb2f-3e44-831e-acff-d8345ad345ae, 10.37.92.8:8300) to [{Suffrage:Voter ID:b84a7750-dcdb-c63a-1ae8-2ef036731c81 Address:10.37.84.6:8300} {Suffrage:Voter ID:29f263d1-e7b5-e905-13b1-931f7968cb3e Address:10.33.92.13:8300} {Suffrage:Voter ID:7931eb2f-3e44-831e-acff-d8345ad345ae Address:10.37.92.8:8300}]

preetapan · 2017-09-06T15:09:41Z

@erkolson Is the cluster operational otherwise, and you are able to use it for service registeration/KV writes etc?

The wrong IP address issue you mentioned above might be a temporary sync issue that affects the output of consul operator raft list-peers till the leader does a reconcile step where it fixes what it displays above. I will have to test it out some more to confirm though.

erkolson · 2017-09-06T15:14:58Z

Indeed, consul kv get and put are working.

I'll leave it running for a bit longer to see if the peers list reconciles. So far, ~30 minutes, no change. consul members does show the correct IP though.

jcassee · 2017-09-06T15:17:07Z

Although at the moment I have no logs to show for it, I had the exact same problem when running the master branch.

preetapan · 2017-09-06T18:04:54Z

@erkolson Do you mind trying the same test with 5 instead of 3 servers? I have a fix in the works for making this work, the root cause is that autopilot will not do the config fix for the server with the wrong IP because that's going to cause it to lose quorum.

Please let me know if you still see the problem with 5 servers..

erkolson · 2017-09-06T18:33:33Z

@preetapan, I ran the test again with 5 servers and this time consul-node-0 was updated

Initial cluster:

Node           ID                                    Address            State     Voter  RaftProtocol
consul-test-1  964b92b9-0ac2-56af-9db9-d30771155c66  10.38.124.6:8300   leader    true   3
consul-test-4  22dd0f0a-a2e7-48d4-d4bb-33726cae71de  10.37.92.7:8300    follower  true   3
consul-test-0  3c1fc748-b5d5-684d-2b73-cc08ce72be6d  10.37.84.4:8300    follower  true   3
consul-test-3  0b9704ae-460a-762f-6c83-19c644899cf6  10.33.58.8:8300    follower  true   3
consul-test-2  1a1046e6-c627-e40d-8108-320fcd818a3e  10.45.124.12:8300  follower  true   3

Intermediate step after pods recreated:

Node           ID                                    Address            State     Voter  RaftProtocol
(unknown)      1a1046e6-c627-e40d-8108-320fcd818a3e  10.45.124.12:8300  follower  true   <=1
consul-test-4  22dd0f0a-a2e7-48d4-d4bb-33726cae71de  10.37.92.8:8300    follower  false  3
consul-test-3  0b9704ae-460a-762f-6c83-19c644899cf6  10.36.204.13:8300  follower  false  3
consul-test-0  3c1fc748-b5d5-684d-2b73-cc08ce72be6d  10.37.84.5:8300    follower  false  3
consul-test-1  964b92b9-0ac2-56af-9db9-d30771155c66  10.38.116.10:8300  follower  false  3

And finally, ~40s after startup:

Node           ID                                    Address            State     Voter  RaftProtocol
consul-test-4  22dd0f0a-a2e7-48d4-d4bb-33726cae71de  10.37.92.8:8300    follower  true   3
consul-test-3  0b9704ae-460a-762f-6c83-19c644899cf6  10.36.204.13:8300  follower  true   3
consul-test-0  3c1fc748-b5d5-684d-2b73-cc08ce72be6d  10.37.84.5:8300    leader    true   3
consul-test-1  964b92b9-0ac2-56af-9db9-d30771155c66  10.38.116.10:8300  follower  true   3
consul-test-2  1a1046e6-c627-e40d-8108-320fcd818a3e  10.36.168.4:8300   follower  false  3

Looks good!

preetapan · 2017-09-06T18:42:34Z

@erkolson Thanks for your help in testing this, we really appreciate it!

slackpad · 2017-09-06T21:14:00Z

We definitely appreciate all the help testing this. We cut a build with the fix @preetapan added via #3450 in https://releases.hashicorp.com/consul/0.9.3-rc2/. If you can give that a look please let us know if you see any remaining issues.

erkolson · 2017-09-07T13:37:08Z

I tested again with 3 nodes and rc2. This time it took ~2 min after startup with new IPs for the peers list to reconcile but all seems to be working.

You are welcome for the help, I'm happy to see this functionality. I have experienced first hand all consul pods being rescheduled simultaneously a couple months ago :-)

faheem-nadeem · 2017-09-30T17:01:47Z

On 0.9.3. Still seems to have cluster leader problem.

preetapan · 2017-09-30T17:19:13Z

@faheem-cliqz We will need more specific information before we can figure out what could have happened with your setup.

Did you set "raft_protocol" to "3" in your config?
The first upgrade still needs to be done in a rolling fashion so that servers can start using uuids for node ids (raft protocol 2 uses ip addresses as the id).

Alexey-Tsarev · 2017-09-30T19:43:48Z

So, I not found a way how can Consul handle IP address change in "runtime".
I handle this via bash script. I didn't find other solution...
This script implemented as a Docker entrypoint file:
https://github.com/AlexeySofree/dockered/blob/master/images/consul-0.9.3/rootfs/docker-entrypoint.sh
Docker file:
https://github.com/AlexeySofree/dockered/blob/master/images/consul-0.9.3/Dockerfile

To test this, run Consul as a Docker container via the docker-compose.yml file:

version: '3'

services:
  # Consul
  consul:
    build:
      context: ../images/consul-0.9.3/
      args:
        - http_proxy
        - https_proxy

    image:          alexeysofree/consul:0.9.3
    container_name: consul
    network_mode:   host
    restart:        unless-stopped

    environment:
      - TERM=xterm
#      - CONSUL_ADVERTISE_PUBLIC_IP
      - CONSUL_ADVERTISE_PUBLIC_IP=30
#      - DEBUG=1

    command: -server -ui
#    command: -server -ui -bootstrap

    labels:
      - SERVICE_NAME=consul

    logging:
      driver: journald
      options:
        tag: consul

    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ${DOCKER_ROOT}/consul/data:/etc/consul/data
      - ${DOCKER_ROOT}/consul/config:/etc/consul/config

    ports:
      - 8300:8300               # Server RPC    (Default 8300). This is used by servers to handle incoming requests from other agents. TCP only.
      - 8301:8301               # Serf LAN      (Default 8301). This is used to handle gossip in the LAN. Required by all agents. TCP and UDP.
      - 8301:8301/udp
      - 8302:8302               # Serf WAN      (Default 8302). This is used by servers to gossip over the WAN to other servers. TCP and UDP.
      - 8302:8302/udp
      - 127.0.0.1:8500:8500     # HTTP API      (Default 8500). This is used by clients to talk to the HTTP API. TCP only.
      - 127.0.0.1:8600:8600     # DNS Interface (Default 8600). Used to resolve DNS queries. TCP and UDP.
      - 127.0.0.1:8600:8600/udp
  # End Consul

dev-rowbot · 2017-10-05T10:35:31Z

@AlexeySofree to get it working I had to add '-raft-protocol 3' to my container run command.
With this enabled I have restarted multiple times and have not seen the issue yet.

More info here: https://www.consul.io/docs/agent/options.html#_raft_protocol

erkolson · 2017-10-12T19:23:06Z

I upgraded a cluster today from 0.8.5 -> 0.9.3. Everything was already using raft 3. The rolling update happened much faster than I expected and there was not enough time in between each node being killed/restarted on the new version for the cluster to elect a leader. Even still, the cluster was able to elect a new leader once everything settled down.

I was also using "leave_on_terminate": true which may or may not have sped up the recovery?

No data was lost

faheem-nadeem · 2017-10-13T08:46:44Z

@preetapan Sorry for the delayed response. Followed your suggestions of shifting to raft 3. Leader election works properly now after a rolling deployment. There was no data loss as well :)

I was using a customized helm chart for consul from here. Currently on consul 0.9.3.

slackpad added the type/enhancement Proposed improvement or new feature label Jan 9, 2016

slackpad mentioned this issue May 4, 2017

Recovery via peers.json doesn't work with Raft protocol version 3 #3003

Closed

slackpad mentioned this issue May 5, 2017

Resient server cluster in a environment where containers get new IPs #1306

Closed

munnerz mentioned this issue May 23, 2017

consul: Cluster doesn't recover after deleting all pods helm/charts#1143

Closed

slackpad mentioned this issue May 25, 2017

Failed to join: Member has conflicting node ID #3070

Closed

slackpad added the theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner label May 25, 2017

slackpad mentioned this issue Jun 24, 2017

No name resolution at advertise parameter #1185

Closed

This was referenced Sep 6, 2017

Move Raft protocol for operator raft list-peers into endpoint, make aware of address mapping #3449

Closed

Change member join reconcile step to process joining itself, to handl… #3450

Merged

menski mentioned this issue Sep 7, 2017

Allow raft nodes to rejoin after IP address change camunda/camunda#439

Closed

rajiteh mentioned this issue Sep 13, 2017

[stable/consul] Consul chart should not be stable helm/charts#1892

Closed

slackpad mentioned this issue Sep 29, 2017

Simple solution for consul on docker swarm 1.12+ #3478

Closed

slackpad mentioned this issue Oct 18, 2017

consul migrate #3418

Closed

shlomi-noach mentioned this issue Oct 26, 2017

orchestrator-raft: are IP addresses really required? openark/orchestrator#253

Open

slackpad mentioned this issue Jan 3, 2018

Cluster lost leadership and not able to restart cluster after adding 3 more servers #3768

Open

jaffee mentioned this issue Aug 1, 2018

Allow a node's address to change while keeping the same Name hashicorp/memberlist#157

Open

pearkes mentioned this issue Aug 7, 2018

consul hangs when ran from a dev server and resumed on a new ip #4494

Closed

paulosotu mentioned this issue Sep 16, 2021

Resuming instance from different network address with existing raft.db triggers unnecessary election and fails to join rqlite/rqlite#818

Closed

ekmixon mentioned this issue Nov 28, 2023

[Snyk] Fix for 7 vulnerabilities ekmixon/consul#529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul should handle nodes changing IP addresses #1580

Consul should handle nodes changing IP addresses #1580

slackpad commented Jan 9, 2016 •

edited

Loading

csawyerYumaed commented Jan 9, 2016

r0bnet commented Jan 14, 2016

jsullivan3 commented Jan 20, 2016

sweeneyb commented May 3, 2017

slackpad commented May 3, 2017

sweeneyb commented May 3, 2017

slackpad commented May 3, 2017

hehailong5 commented May 4, 2017

slackpad commented May 4, 2017

kamaradclimber commented Jun 8, 2017 •

edited

Loading

Alexey-Tsarev commented Jun 25, 2017

mitom commented Jun 26, 2017

Alexey-Tsarev commented Jun 26, 2017 •

edited

Loading

erkolson commented Sep 6, 2017

preetapan commented Sep 6, 2017

slackpad commented Sep 6, 2017

erkolson commented Sep 6, 2017 •

edited

Loading

preetapan commented Sep 6, 2017

erkolson commented Sep 6, 2017 •

edited

Loading

jcassee commented Sep 6, 2017

preetapan commented Sep 6, 2017

erkolson commented Sep 6, 2017

preetapan commented Sep 6, 2017

slackpad commented Sep 6, 2017

erkolson commented Sep 7, 2017 •

edited

Loading

faheem-nadeem commented Sep 30, 2017

preetapan commented Sep 30, 2017

Alexey-Tsarev commented Sep 30, 2017 •

edited

Loading

dev-rowbot commented Oct 5, 2017

erkolson commented Oct 12, 2017 •

edited

Loading

faheem-nadeem commented Oct 13, 2017 •

edited

Loading

Consul should handle nodes changing IP addresses #1580

Consul should handle nodes changing IP addresses #1580

Comments

slackpad commented Jan 9, 2016 • edited Loading

csawyerYumaed commented Jan 9, 2016

r0bnet commented Jan 14, 2016

jsullivan3 commented Jan 20, 2016

sweeneyb commented May 3, 2017

slackpad commented May 3, 2017

sweeneyb commented May 3, 2017

slackpad commented May 3, 2017

hehailong5 commented May 4, 2017

slackpad commented May 4, 2017

kamaradclimber commented Jun 8, 2017 • edited Loading

Alexey-Tsarev commented Jun 25, 2017

mitom commented Jun 26, 2017

Alexey-Tsarev commented Jun 26, 2017 • edited Loading

erkolson commented Sep 6, 2017

preetapan commented Sep 6, 2017

slackpad commented Sep 6, 2017

erkolson commented Sep 6, 2017 • edited Loading

preetapan commented Sep 6, 2017

erkolson commented Sep 6, 2017 • edited Loading

jcassee commented Sep 6, 2017

preetapan commented Sep 6, 2017

erkolson commented Sep 6, 2017

preetapan commented Sep 6, 2017

slackpad commented Sep 6, 2017

erkolson commented Sep 7, 2017 • edited Loading

faheem-nadeem commented Sep 30, 2017

preetapan commented Sep 30, 2017

Alexey-Tsarev commented Sep 30, 2017 • edited Loading

dev-rowbot commented Oct 5, 2017

erkolson commented Oct 12, 2017 • edited Loading

faheem-nadeem commented Oct 13, 2017 • edited Loading

slackpad commented Jan 9, 2016 •

edited

Loading

kamaradclimber commented Jun 8, 2017 •

edited

Loading

Alexey-Tsarev commented Jun 26, 2017 •

edited

Loading

erkolson commented Sep 6, 2017 •

edited

Loading

erkolson commented Sep 6, 2017 •

edited

Loading

erkolson commented Sep 7, 2017 •

edited

Loading

Alexey-Tsarev commented Sep 30, 2017 •

edited

Loading

erkolson commented Oct 12, 2017 •

edited

Loading

faheem-nadeem commented Oct 13, 2017 •

edited

Loading