-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split nomad cluster into two clusters 0.9.3 #5917
Comments
If possible, can you post the server config? |
Nomad config (nomad.hcl)
Nomad systemd service file (nomad.service)
Consul config (consul.hcl)
Consul systemd file (consul.service)
|
we were able to simulate the same behaviour by calling systemctl restart nomad on all three servers |
Does it make sense to workaround this issue and disable server discovery using consul and enumerate server ipaddress (fixed ip addresses on physical infrastructure). I am thinking about:
|
@jozef-slezak Are you running |
Yes, I am running |
So I believe the problem you are facing is due to the fact that you are effectively breaking the consensus between the server nodes by restarting all the processes at the same time. If you need to restart server nodes you generally restart them one at a time so that another server has the ability to become the leader and other nodes to continue as a follower and allow the state to be replicated safely for durability. This is why it is advised to run odd number of servers to avoid the scenario that you have described in this issue. Hope this helps. |
I understand, restarting all servers at the same time simulates a power outage. I believe that implementation is meant to work properly even in this scenario (bootstrap expect 3 servers). From my point of we reproduced a bug: one node is a leader without a quorum (see |
@jozef-slezak , thanks for the report. I've tried reproducing this without any success, I will continue to look into it. Furthermore, I'll bring this up with the team. |
Best way how to reproduce: automate cluster restarts and repeat untill it breaks. |
Okay, just saw it with the latest build of Nomad (11afd99). We will take a deeper look at this. Thanks for the report! |
Rebooting machines with 3 node cluster cause cluster split (once, but many times without any problems).
It would be great if Nomad would have automated CI related to restarts.
Nomad version
0.9.3
Operating system and Environment details
Linux
Issue
Reproduction steps
Running 3 nomad servers and 47 nomad clients
Quick sudo reboot on 3 nomad servers
nomad server members shows one leader and no follower on one server:
My concern is that one node can be a leader even without a quorum. I am not sure the discovery/search continues (https://www.nomadproject.io/docs/configuration/consul.html#server_auto_join).
nomad server members shows two followers and error no leader on next two servers:
After restarting one follower again all three servers joined the cluster.
Could Nomad do some retries on its own? Or should we configure something? Maybe autopilot? How non_voting_servers would help (would they also help/minimize nomad client job restarts?)
Nomad Server logs (if appropriate)
The text was updated successfully, but these errors were encountered: