-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: Allow Join to be called multiple times for the same cluster member #2198
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2198 +/- ##
==========================================
+ Coverage 61.07% 61.19% +0.12%
==========================================
Files 128 128
Lines 20556 20581 +25
==========================================
+ Hits 12554 12595 +41
+ Misses 6627 6598 -29
- Partials 1375 1388 +13 |
@aluzzardi: This will handle the case where the user wants to manually update the advertise address, or rejoin a cluster where all the IPs have changed, as we discussed. However, for automatic IP address updating, I'm thinking now that it's better to include the advertise address in the
|
One potential issue with automatically updating the IP addresses (or encouraging people to script it) is that if someone wrongly copies the |
I've added a commit that avoids skipping |
I have some moby changes related to this at https://github.com/aaronlehmann/docker/tree/swarmkit-rejoin. It also covers #2199. I'm questioning whether allowing So now I'm leaning towards reverting the ability to run |
@aaronlehmann It could be a worker that was down while the other managers changed IPs, or alternately that went down if it only knew about 1 manager, that has since changed IPs. However, it seems less important for workers to be able to re-join the cluster as it was - spinning up a new node seems fine if it's a worker. Other than that I think I agree making |
There's one use case I remembered where repeating I tested this use case with the linked moby fork and it seems to work. So, the question is whether repeated |
@aaronlehmann Wondering if it'd make sense to put that in |
`node update` on the node itself or on another manager?
|
@aaronlehmann I thinking if you call it on another manager you can either specify the address for that node, or tell the cluster to pick the address that it is connected to the cluster from? |
Sorry, I don't mean to bikeshed this PR - I'm fine with not supporting it at all as well. If we did want to support it, "updating" the address of the node makes more sense than re-joining the cluster as an operation, but I don't feel very strongly. |
Using This would be in addition to supporting some other mechanism (such as repeated Despite being more technically complex, I'd be curious to hear @aluzzardi's thoughts, since he was the one who suggested allowing |
The code mostly LGTM - I have 2 non-blocking questions though:
|
It won't fail in this case, it just will not call
That's a good point, but I'm not sure I see a good way to unify them. The use case for In theory |
Ok, thank you for explaining! LGTM if |
@aluzzardi: I think if we get this in, it will resolve the remaining problems being seen with e2e promotion/demotion tests. The old code won't let a node that is part of the raft cluster join it again. If a node is demoted, and removes its state, but it is immediately re-promoted and never actually gets removed from the raft cluster, it won't be able to rejoin. However, with this PR, there is no such problem, because repeated joins are allowed. |
We had (have?) the same issue with classic Swarm. I believe @wsong had a workaround? |
In Swarm's case, we were using the engine ID (which I believe is defined in |
I think the use cases to support multiple joins are:
I think both make sense: If the user passes an address, then we use that one no matter what. Bypass all detection. Thoughts? |
This sounds good to me if it's the approach people are most comfortable with. I have some older thoughts here: #2198 (comment) #2198 (comment) Basically, of the use cases above, (1) doesn't seem very important to me, (2) is an extreme case that might require force-new-cluster anyway, and (3) is easy to fix with a leave/join cycle, since temporarily removing a worker is not disruptive (I assume this means "all other managers have changed IP"). So (4), specifying a different fixed advertise address or listen address, or turning on or off address autodetection, is the only use case that I see as valuable for repeated |
ping @aluzzardi any thoughts on the above? |
@binman-docker @jakegdocker @thaJeztah: Any thoughts on the UX? |
I like the behavior and the option, I can see it being handy in some environments (especially changing the IP without removing node from the Swarm if people don't want to churn the containers). For our use case it's not really relevant - IP's don't change for the lifetime of the instance, we just replace hosts, and they only join once. I'm not sure about the cases where a node would need to rejoin a cluster after all of the other IP's changed - is their change of IP not something that would have been communicated to the node as part of the normal gossip? Or are we targeting the case where those other nodes lost contact and could not gossip the change of IP? |
Gossip is only used for networking. The use case here is for a node to find the managers when it doesn't know any current manager IP addresses. |
I think we should move forward with this PR. It fixes a some actual problems, for example see #2196 (comment). The UX discussion is more related to #2199, so we shouldn't let it block this PR, which is purely internals and doesn't change UX. The necessary changes in moby/moby have already been merged some time ago in moby/moby#33361. |
This still LGTM, although since it's been a while, would it be worth trying to rebase + CI to make sure there are no semantic conflicts? |
If Join is called and the member already exists in the cluster, instead of returning an error, it will update the node's address with the new one provided to the RPC. This will allow managers to update their addresses automatically on startup, if they were configured to autodetect the addresses. It will also be possible to manually repeat the "docker swarm join" command, to specify a different advertise address, or rejoin a cluster when all known manager IPs have changed. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Rebased |
If Join is called and the member already exists in the cluster, instead
of returning an error, it will update the node's address with the new
one provided to the RPC.
This will allow managers to update their addresses automatically on
startup, if they were configured to autodetect the addresses.
It will also be possible to manually repeat the "docker swarm join"
command, to specify a different advertise address, or rejoin a cluster
when all known manager IPs have changed.
cc @aluzzardi @cyli