Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split nomad cluster into two clusters 0.9.3 #5917

Open
jozef-slezak opened this issue Jul 3, 2019 · 11 comments
Open

Split nomad cluster into two clusters 0.9.3 #5917

jozef-slezak opened this issue Jul 3, 2019 · 11 comments

Comments

@jozef-slezak
Copy link

jozef-slezak commented Jul 3, 2019

Rebooting machines with 3 node cluster cause cluster split (once, but many times without any problems).
It would be great if Nomad would have automated CI related to restarts.

Nomad version

0.9.3

Operating system and Environment details

Linux

Issue

Reproduction steps

Running 3 nomad servers and 47 nomad clients
Quick sudo reboot on 3 nomad servers

nomad server members shows one leader and no follower on one server:

nomad server members
Name   Address       Port  Status  Leader  Protocol  Build  Datacenter  Region
name1 ip1 4648  alive   true    2         0.9.3  dc2         global

My concern is that one node can be a leader even without a quorum. I am not sure the discovery/search continues (https://www.nomadproject.io/docs/configuration/consul.html#server_auto_join).

nomad server members shows two followers and error no leader on next two servers:

nomad server members
Name    Address       Port  Status  Leader  Protocol  Build  Datacenter  Region
name2  ip2 4648  alive   false   2         0.9.3  dc2         global
name3  ip3 4648  alive   false   2         0.9.3  dc2         global

After restarting one follower again all three servers joined the cluster.

Could Nomad do some retries on its own? Or should we configure something? Maybe autopilot? How non_voting_servers would help (would they also help/minimize nomad client job restarts?)

Nomad Server logs (if appropriate)

@cgbaker
Copy link
Contributor

cgbaker commented Jul 3, 2019

If possible, can you post the server config?

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 4, 2019

Nomad config (nomad.hcl)

addresses = {
  http = "0.0.0.0"
}
advertise = {
  http = "172.16.23.67"
  rpc = "172.16.23.67"
  serf = "172.16.23.67"
}
bind_addr = "172.16.23.67"
client = {
  enabled = true
  network_interface = "bond0"
  options = {
    driver.raw_exec.enable = 1
    driver.raw_exec.no_cgroups = 1
  }
  meta = {
    "abis-manager" = true
  }
}
enable_syslog = true
data_dir = "/var/lib/nomad"
datacenter = "dc2"
disable_update_check = true
log_level = "INFO"
server = {
  bootstrap_expect = 3
  enabled = true
  encrypt = "JrVVHZY9wTMQvp107LpLAA=="
  }

Nomad systemd service file (nomad.service)

[Unit]
Description=HashiCorp Nomad
After=network-online.target
Requires=network-online.target

Wants=consul.service
After=consul.service

[Service]
Type=simple
ExecStart=/usr/sbin/nomad agent -config=/etc/nomad.d
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
LimitNOFILE=65536
LimitNPROC=65536

[Install]
WantedBy=multi-user.target

Consul config (consul.hcl)

bind_addr = "172.16.23.67"
client_addr = "0.0.0.0"
data_dir = "/var/lib/consul"
datacenter = "dc2"
disable_update_check = true
encrypt = "XeK8LHcwhHGf54lk8M4dpw=="
encrypt_verify_incoming = false
encrypt_verify_outgoing = false
log_level = "INFO"
retry_join = [  "172.16.23.51","172.16.23.67","172.16.23.83" ]
rejoin_after_leave = true
ui = true
server = true
bootstrap_expect = 3

Consul systemd file (consul.service)

[Unit]
Description=HashiCorp Consul
After=network-online.target
Requires=network-online.target

[Service]
Type=simple
ExecStart=/usr/sbin/consul agent -config-dir=/etc/consul.d
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
User=consul

[Install]
WantedBy=multi-user.target

@jozef-slezak
Copy link
Author

we were able to simulate the same behaviour by calling

systemctl restart nomad on all three servers

@jozef-slezak
Copy link
Author

jozef-slezak commented Jul 4, 2019

Does it make sense to workaround this issue and disable server discovery using consul and enumerate server ipaddress (fixed ip addresses on physical infrastructure). I am thinking about:

consul {
  server_auto_join    = false
}
server {
  enabled          = true
  bootstrap_expect = 3
  server_join {
    retry_join = [ "1.1.1.1", "2.2.2.2" ]
    retry_max = 0 # infinite
    retry_interval = "3s"
  }
}

@lfarnell
Copy link
Contributor

lfarnell commented Jul 5, 2019

@jozef-slezak Are you running systemctl restart nomad on all 3 servers at the same time?

@jozef-slezak
Copy link
Author

Yes, I am running systemctl restart nomad nearly at the same time.

@lfarnell
Copy link
Contributor

lfarnell commented Jul 5, 2019

So I believe the problem you are facing is due to the fact that you are effectively breaking the consensus between the server nodes by restarting all the processes at the same time. If you need to restart server nodes you generally restart them one at a time so that another server has the ability to become the leader and other nodes to continue as a follower and allow the state to be replicated safely for durability. This is why it is advised to run odd number of servers to avoid the scenario that you have described in this issue. Hope this helps.

@jozef-slezak
Copy link
Author

I understand, restarting all servers at the same time simulates a power outage. I believe that implementation is meant to work properly even in this scenario (bootstrap expect 3 servers).

From my point of we reproduced a bug: one node is a leader without a quorum (see nomad server members at the beginning of the issue). At least two nodes must be there for having a leader.

@cgbaker
Copy link
Contributor

cgbaker commented Jul 5, 2019

@jozef-slezak , thanks for the report. I've tried reproducing this without any success, I will continue to look into it. Furthermore, I'll bring this up with the team.

@jozef-slezak
Copy link
Author

Best way how to reproduce: automate cluster restarts and repeat untill it breaks.

@cgbaker
Copy link
Contributor

cgbaker commented Jul 5, 2019

Okay, just saw it with the latest build of Nomad (11afd99). We will take a deeper look at this. Thanks for the report!

@cgbaker cgbaker removed their assignment Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants