Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Transfer leadership #476

Closed
ctro opened this issue Nov 17, 2014 · 13 comments
Closed

Feature Request: Transfer leadership #476

ctro opened this issue Nov 17, 2014 · 13 comments

Comments

@ctro
Copy link

ctro commented Nov 17, 2014

I'm running an immutable infrastructure, of which our Consul Cluster is a part. Sometimes we need to upgrade our consul cluster, perhaps because the underlying OS has needed updates.

It's super easy to spin up new nodes and join them to the existing cluster. It's also super easy to consul leave the old nodes that aren't leader. Everything just works.

But running consul leave on the "old" cluster's leader leaves us with a "new", but leaderless cluster.

It would be great if there were some sort of leadership transfer command that would allow me to transfer cluster leadership to one of the "new" nodes before running consul leave on the last remaining "old" node.

@armon
Copy link
Member

armon commented Nov 17, 2014

This is odd. There should be a brief period of time without a leader (<1s) but then a new leader should automatically be elected. Is this not what you are experiencing?

@ctro
Copy link
Author

ctro commented Nov 20, 2014

Correct. I'm not seeing a new leader be elected. I've reproduced this a couple times in our production environment (ug :))

After 3 min of a leaderless cluster I stopped one of the "new" nodes, and brought him back up in bootstrap mode, which made the cluster functional again. Again restarting that same node without bootstrap allowed one of the other "new" nodes to pick up leadership very quickly.

We're running v0.4.0, which I see is now not the latest.

Also, I noticed this ticket which seemed potentially related?
#360

@armon
Copy link
Member

armon commented Nov 20, 2014

Yes, there was an issue in 0.4.0 fixed in 0.4.1 that could explain this behavior. Please try with the latest version, and let us know if its still an issue!

@ctro
Copy link
Author

ctro commented Nov 20, 2014

Thanks @armon. It'll be some days before I can get us upgraded to 0.4.1 but I will most def. report back.

@francois
Copy link

francois commented Dec 6, 2014

I have a similar issue. Using the Vagrantfile in francois/consul-playground@642af4d, once the three consul servers are booted, when I interrupt them all (Ctrl+C from the shell), if I don't clean out /var/lib/consul (-data-dir), then no new leader will be elected. Here's the info log:

# consul10, first server booted
vagrant@consul10:/home/vagrant  $ consul agent -config-dir /etc/consul.d
==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'consul10'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.10.10.10 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/12/06 17:31:15 [INFO] serf: EventMemberJoin: consul10 10.10.10.10
    2014/12/06 17:31:15 [INFO] serf: EventMemberJoin: consul10.dc1 10.10.10.10
    2014/12/06 17:31:15 [INFO] raft: Node at 10.10.10.10:8300 [Follower] entering Follower state
    2014/12/06 17:31:15 [INFO] consul: adding server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:31:15 [INFO] consul: adding server consul10.dc1 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:31:15 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:31:15 [INFO] agent: Joining cluster...
    2014/12/06 17:31:15 [INFO] agent: (LAN) joining: [10.10.10.10 10.20.20.20 10.30.30.30]
    2014/12/06 17:31:16 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    2014/12/06 17:31:18 [INFO] serf: EventMemberJoin: consul20 10.20.20.20
    2014/12/06 17:31:18 [INFO] consul: adding server consul20 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:31:20 [INFO] serf: EventMemberJoin: consul30 10.30.30.30
    2014/12/06 17:31:20 [INFO] consul: adding server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:31:20 [INFO] consul: Attempting bootstrap with nodes: [10.10.10.10:8300 10.20.20.20:8300 10.30.30.30:8300]
    2014/12/06 17:31:20 [INFO] consul: New leader elected: consul20
    2014/12/06 17:31:21 [INFO] agent: (LAN) joined: 3 Err: <nil>
    2014/12/06 17:31:21 [INFO] agent: Join completed. Synced with 3 initial agents
    2014/12/06 17:31:23 [INFO] agent: Synced service 'consul'
# consul20, second server
vagrant@consul20:/home/vagrant  $ consul agent -config-dir /etc/consul.d
==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'consul20'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.20.20.20 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/12/06 17:31:18 [INFO] serf: EventMemberJoin: consul20 10.20.20.20
    2014/12/06 17:31:18 [INFO] serf: EventMemberJoin: consul20.dc1 10.20.20.20
    2014/12/06 17:31:18 [INFO] raft: Node at 10.20.20.20:8300 [Follower] entering Follower state
    2014/12/06 17:31:18 [INFO] consul: adding server consul20 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:31:18 [INFO] consul: adding server consul20.dc1 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:31:18 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:31:18 [INFO] agent: Joining cluster...
    2014/12/06 17:31:18 [INFO] agent: (LAN) joining: [10.10.10.10 10.20.20.20 10.30.30.30]
    2014/12/06 17:31:18 [INFO] serf: EventMemberJoin: consul10 10.10.10.10
    2014/12/06 17:31:18 [INFO] consul: adding server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:31:19 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    2014/12/06 17:31:20 [INFO] serf: EventMemberJoin: consul30 10.30.30.30
    2014/12/06 17:31:20 [INFO] consul: adding server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:31:20 [INFO] consul: Attempting bootstrap with nodes: [10.20.20.20:8300 10.10.10.10:8300 10.30.30.30:8300]
    2014/12/06 17:31:20 [WARN] raft: Heartbeat timeout reached, starting election
    2014/12/06 17:31:20 [INFO] raft: Node at 10.20.20.20:8300 [Candidate] entering Candidate state
    2014/12/06 17:31:20 [INFO] raft: Election won. Tally: 2
    2014/12/06 17:31:20 [INFO] raft: Node at 10.20.20.20:8300 [Leader] entering Leader state
    2014/12/06 17:31:20 [INFO] consul: cluster leadership acquired    2014/12/06 17:31:20 [INFO] raft: pipelining replication to peer 10.30.30.30:8300
    2014/12/06 17:31:20 [INFO] raft: pipelining replication to peer 10.10.10.10:8300

    2014/12/06 17:31:20 [INFO] consul: New leader elected: consul20
    2014/12/06 17:31:20 [INFO] consul: member 'consul10' joined, marking health alive
    2014/12/06 17:31:20 [INFO] consul: member 'consul30' joined, marking health alive
    2014/12/06 17:31:20 [INFO] consul: member 'consul20' joined, marking health alive
    2014/12/06 17:31:21 [INFO] agent: (LAN) joined: 3 Err: <nil>
    2014/12/06 17:31:21 [INFO] agent: Join completed. Synced with 3 initial agents
    2014/12/06 17:31:21 [INFO] agent: Synced service 'consul'
# consul30, third server
vagrant@consul30:/home/vagrant  $ consul agent -config-dir /etc/consul.d
==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'consul30'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.30.30.30 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/12/06 17:31:20 [INFO] serf: EventMemberJoin: consul30 10.30.30.30
    2014/12/06 17:31:20 [INFO] serf: EventMemberJoin: consul30.dc1 10.30.30.30
    2014/12/06 17:31:20 [INFO] raft: Node at 10.30.30.30:8300 [Follower] entering Follower state
    2014/12/06 17:31:20 [INFO] consul: adding server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:31:20 [INFO] consul: adding server consul30.dc1 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:31:20 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:31:20 [INFO] agent: Joining cluster...
    2014/12/06 17:31:20 [INFO] agent: (LAN) joining: [10.10.10.10 10.20.20.20 10.30.30.30]
    2014/12/06 17:31:20 [INFO] serf: EventMemberJoin: consul20 10.20.20.20
    2014/12/06 17:31:20 [INFO] serf: EventMemberJoin: consul10 10.10.10.10
    2014/12/06 17:31:20 [INFO] consul: adding server consul20 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:31:20 [INFO] consul: Attempting bootstrap with nodes: [10.30.30.30:8300 10.20.20.20:8300 10.10.10.10:8300]
    2014/12/06 17:31:20 [INFO] consul: adding server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:31:20 [INFO] agent: (LAN) joined: 3 Err: <nil>
    2014/12/06 17:31:20 [INFO] agent: Join completed. Synced with 3 initial agents
    2014/12/06 17:31:20 [INFO] consul: New leader elected: consul20
    2014/12/06 17:31:22 [INFO] agent: Synced service 'consul'

Then, I stop consul10, 30 and 20, in that order (20 is the leader):

# consul10
^C==> Caught signal: interrupt
==> Gracefully shutting down agent...
    2014/12/06 17:32:42 [INFO] consul: server starting leave
    2014/12/06 17:32:42 [INFO] serf: EventMemberLeave: consul10.dc1 10.10.10.10
    2014/12/06 17:32:42 [INFO] consul: removing server consul10.dc1 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:32:43 [INFO] serf: EventMemberLeave: consul10 10.10.10.10
    2014/12/06 17:32:43 [INFO] consul: removing server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:32:43 [INFO] raft: Removed ourself, transitioning to follower
    2014/12/06 17:32:43 [INFO] agent: requesting shutdown
    2014/12/06 17:32:43 [INFO] consul: shutting down server
    2014/12/06 17:32:43 [INFO] agent: shutdown complete
# consul30
    2014/12/06 17:32:43 [INFO] serf: EventMemberLeave: consul10 10.10.10.10
    2014/12/06 17:32:43 [INFO] consul: removing server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
^C==> Caught signal: interrupt
==> Gracefully shutting down agent...
    2014/12/06 17:32:45 [INFO] consul: server starting leave
    2014/12/06 17:32:45 [INFO] serf: EventMemberLeave: consul30.dc1 10.30.30.30
    2014/12/06 17:32:45 [INFO] consul: removing server consul30.dc1 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:32:46 [INFO] serf: EventMemberLeave: consul30 10.30.30.30
    2014/12/06 17:32:46 [INFO] consul: removing server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:32:46 [INFO] raft: Removed ourself, transitioning to follower
    2014/12/06 17:32:46 [WARN] raft: Clearing log suffix from 12 to 12
    2014/12/06 17:32:46 [INFO] agent: requesting shutdown
    2014/12/06 17:32:46 [INFO] consul: shutting down server
    2014/12/06 17:32:46 [INFO] agent: shutdown complete
# consul20
    2014/12/06 17:32:43 [INFO] serf: EventMemberLeave: consul10 10.10.10.10
    2014/12/06 17:32:43 [INFO] consul: removing server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:32:43 [INFO] consul: server 'consul10' left, removing as peer
    2014/12/06 17:32:43 [INFO] raft: Removed peer 10.10.10.10:8300, stopping replication (Index: 10)
    2014/12/06 17:32:43 [INFO] consul: member 'consul10' left, deregistering
    2014/12/06 17:32:43 [INFO] raft: aborting pipeline replication to peer 10.10.10.10:8300
    2014/12/06 17:32:46 [INFO] serf: EventMemberLeave: consul30 10.30.30.30
    2014/12/06 17:32:46 [INFO] consul: removing server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:32:46 [INFO] consul: server 'consul30' left, removing as peer
    2014/12/06 17:32:46 [INFO] raft: Removed peer 10.30.30.30:8300, stopping replication (Index: 12)
    2014/12/06 17:32:46 [INFO] consul: member 'consul30' left, deregistering
    2014/12/06 17:32:46 [INFO] raft: aborting pipeline replication to peer 10.30.30.30:8300
    2014/12/06 17:32:46 [INFO] raft: pipelining replication to peer 10.30.30.30:8300
    2014/12/06 17:32:46 [INFO] raft: aborting pipeline replication to peer 10.30.30.30:8300
^C==> Caught signal: interrupt
==> Gracefully shutting down agent...
    2014/12/06 17:32:48 [INFO] consul: server starting leave
    2014/12/06 17:32:48 [INFO] serf: EventMemberLeave: consul20.dc1 10.20.20.20
    2014/12/06 17:32:48 [INFO] serf: EventMemberLeave: consul20 10.20.20.20
    2014/12/06 17:32:48 [INFO] agent: requesting shutdown
    2014/12/06 17:32:48 [INFO] consul: removing server consul20.dc1 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:32:48 [INFO] consul: removing server consul20 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:32:48 [INFO] consul: server 'consul20' left, removing as peer
    2014/12/06 17:32:48 [INFO] consul: shutting down server
    2014/12/06 17:32:48 [INFO] raft: Removed ourself, transitioning to follower
    2014/12/06 17:32:48 [INFO] raft: Node at 10.20.20.20:8300 [Follower] entering Follower state
    2014/12/06 17:32:48 [WARN] consul: deregistering self (consul20) should be done by follower
    2014/12/06 17:32:48 [INFO] agent: shutdown complete

Now, I will reboot the servers, consul20 (the old leader), then consul10 and 30, in that order:

# consul20
vagrant@consul20:/home/vagrant  $ consul agent -config-dir /etc/consul.d
==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'consul20'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.20.20.20 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/12/06 17:34:41 [INFO] serf: EventMemberJoin: consul20 10.20.20.20
    2014/12/06 17:34:41 [INFO] serf: EventMemberJoin: consul20.dc1 10.20.20.20
    2014/12/06 17:34:41 [INFO] raft: Node at 10.20.20.20:8300 [Follower] entering Follower state
    2014/12/06 17:34:41 [INFO] consul: adding server consul20 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:34:41 [INFO] consul: adding server consul20.dc1 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:34:41 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:34:41 [INFO] agent: Joining cluster...
    2014/12/06 17:34:41 [INFO] agent: (LAN) joining: [10.10.10.10 10.20.20.20 10.30.30.30]
    2014/12/06 17:34:42 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    2014/12/06 17:34:53 [INFO] serf: EventMemberJoin: consul10 10.10.10.10
    2014/12/06 17:34:53 [INFO] consul: adding server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:34:57 [INFO] serf: EventMemberJoin: consul30 10.30.30.30
    2014/12/06 17:34:57 [INFO] consul: adding server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:34:58 [INFO] agent: (LAN) joined: 2 Err: <nil>
    2014/12/06 17:34:58 [INFO] agent: Join completed. Synced with 2 initial agents
# consul10
vagrant@consul10:/home/vagrant  $ consul agent -config-dir /etc/consul.d
==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'consul10'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.10.10.10 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/12/06 17:34:53 [INFO] serf: EventMemberJoin: consul10 10.10.10.10
    2014/12/06 17:34:53 [INFO] serf: EventMemberJoin: consul10.dc1 10.10.10.10
    2014/12/06 17:34:53 [INFO] raft: Node at 10.10.10.10:8300 [Follower] entering Follower state
    2014/12/06 17:34:53 [INFO] consul: adding server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:34:53 [INFO] consul: adding server consul10.dc1 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:34:53 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:34:53 [INFO] agent: Joining cluster...
    2014/12/06 17:34:53 [INFO] agent: (LAN) joining: [10.10.10.10 10.20.20.20 10.30.30.30]
    2014/12/06 17:34:53 [INFO] serf: EventMemberJoin: consul20 10.20.20.20
    2014/12/06 17:34:53 [INFO] consul: adding server consul20 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:34:55 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    2014/12/06 17:34:57 [INFO] serf: EventMemberJoin: consul30 10.30.30.30
    2014/12/06 17:34:57 [INFO] consul: adding server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:35:00 [INFO] agent: (LAN) joined: 3 Err: <nil>
    2014/12/06 17:35:00 [INFO] agent: Join completed. Synced with 3 initial agents
    2014/12/06 17:35:12 [ERR] agent: failed to sync remote state: No cluster leader
# consul30
vagrant@consul30:/home/vagrant  $ consul agent -config-dir /etc/consul.d
==> WARNING: Expect Mode enabled, expecting 3 servers
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'consul30'
        Datacenter: 'dc1'
            Server: true (bootstrap: false)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.30.30.30 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/12/06 17:34:57 [INFO] serf: EventMemberJoin: consul30 10.30.30.30
    2014/12/06 17:34:57 [INFO] serf: EventMemberJoin: consul30.dc1 10.30.30.30
    2014/12/06 17:34:57 [INFO] raft: Node at 10.30.30.30:8300 [Follower] entering Follower state
    2014/12/06 17:34:57 [INFO] consul: adding server consul30 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:34:57 [INFO] consul: adding server consul30.dc1 (Addr: 10.30.30.30:8300) (DC: dc1)
    2014/12/06 17:34:57 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:34:57 [INFO] agent: Joining cluster...
    2014/12/06 17:34:57 [INFO] agent: (LAN) joining: [10.10.10.10 10.20.20.20 10.30.30.30]
    2014/12/06 17:34:57 [INFO] serf: EventMemberJoin: consul10 10.10.10.10
    2014/12/06 17:34:57 [INFO] serf: EventMemberJoin: consul20 10.20.20.20
    2014/12/06 17:34:57 [INFO] consul: adding server consul10 (Addr: 10.10.10.10:8300) (DC: dc1)
    2014/12/06 17:34:57 [INFO] consul: adding server consul20 (Addr: 10.20.20.20:8300) (DC: dc1)
    2014/12/06 17:34:57 [INFO] agent: (LAN) joined: 3 Err: <nil>
    2014/12/06 17:34:57 [INFO] agent: Join completed. Synced with 3 initial agents
    2014/12/06 17:34:59 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    2014/12/06 17:35:24 [ERR] agent: failed to sync remote state: No cluster leader

And the failed to sync remote state: No cluster leader message repeats every few seconds or so. After 5 minutes, still no leader:

# consul10
    2014/12/06 17:35:12 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:35:39 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:35:56 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:36:18 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:36:39 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:36:58 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:37:16 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:37:44 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:38:03 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:38:32 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:38:54 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:39:16 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:39:35 [ERR] agent: failed to sync remote state: No cluster leader
    2014/12/06 17:40:04 [ERR] agent: failed to sync remote state: No cluster leader

The only way I will be able to get a leader will be to stop all servers again, then destroy the contents of /var/lib/consul and booting again.

@armon
Copy link
Member

armon commented Dec 8, 2014

You are probably having the servers gracefully leave, which removes them from the peer set of Raft. Once 2 of the servers are removed, you lose quorum and the cluster goes into an outage. Outage recovery is done via: http://consul.io/docs/guides/outage.html

If you hard stop them, or they crash, or power fail, when they restart a new leader will be elected. Graceful leave of all servers will cause an outage.

@AirbornePorcine
Copy link

Same issue as @francois here, playing around with Consul (0.5.0 now) in a Vagrant environment.

@armon I had a look at the outage guide, but the problem is that the content of the peers.json file on all three nodes is the string "null". Removing the file didn't seem to make a difference, nor did restarting the server agents with or without -bootstrap-expect - they simply won't elect a leader without wiping my data directory.

@AirbornePorcine
Copy link

Played with this a little more - I can get it to elect a new leader if I start two nodes without any bootstrap information, then start the third node with -bootstrap -join=OtherNodeIP

Not sure if this is perhaps a unique situation?

@armon
Copy link
Member

armon commented Mar 23, 2015

@AirbornePorcine If the peers file is blank, this means you had the nodes do a graceful leave. This is expected behavior. If the nodes leave, you don't want them to disrupt the existing cluster unless they rejoin. You would have to use the -bootstrap flag as you discovered at that point.

The -bootstrap-expect flag only operates if the cluster is fresh (e.g. no data in the Raft log) otherwise it is potentially unsafe.

@AirbornePorcine
Copy link

Great, thanks @armon !

@plombardi89
Copy link

@ctro never responded back with his experience unfortunately, but we're running a very similar immutable setup and have run into the same situation with 0.5.0. Is there a suggested process to follow here? I do not fully understand why the cluster goes into outage mode when during the upgrade process we balloon out from cluster size of 3 to 6 nodes (1 leader, 5 peers). Shouldn't we never lose quorum if we have six nodes and then shutdown the older nodes thus returning us to a size of three? Should I force-leave the old nodes instead?

@armon
Copy link
Member

armon commented Apr 18, 2015

@plombardi89 It depends on how you are removing the older nodes. If they do not leave gracefully, then they are still part of the raft replication group. This means only 3 of 6 members are reachable and quorum is lost. If they leave gracefully or you remove one, then force leave, remove another, etc, then it will be okay as well.

As it stands you are dividing the cluster in half, and it is no longer able to safely commit new transactions.

@ctro
Copy link
Author

ctro commented Apr 28, 2015

@plombardi89
Well, that took me forever :)
I just upgraded our consul boxes to 0.5.0. Bringing up 3 new 0.5.0 nodes to join the old cluster worked.
Simply consul leaveing on each of the "old" nodes successfully transferred leadership to one of the "new" nodes. Yipee :)

@armon, at this point I'll leave closing this issue up to you...

@armon armon closed this as completed May 27, 2015
duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021
duckhan pushed a commit to duckhan/consul that referenced this issue Oct 24, 2021
Remove the cleanup controller and replace it with the supporting logic from hashicorp#457.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants