-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow changing Node names since Node now have IDs #3983
Conversation
I basically removed the IP address limitation since node ID is supposed to be immutable, furthermore it is very likely that a nodes does change its IPs as well as its name at the same time for operation reasons (changing node's domain for instance) |
Any news on that PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one concern. If you can address that then I think we can merge this.
agent/consul/state/catalog.go
Outdated
return fmt.Errorf("node ID %q for node %q aliases existing node %q", | ||
node.ID, node.Node, n.Node) | ||
// We are actually renaming a node, remove its reference first | ||
err := s.deleteNodeTxn(tx, idx, n.Node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierresouchay One concern we have is that this will allow multiple nodes to use the same name with different UUIDs. This might break some things and at the very least could be confusing with regards to metrics and labels.
I think this could be remedied by refactoring this block of code and the lookup by name right after this. Essentially we would want to only delete the node when it has an id, the name is different and the name is not already associated with another node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkeeler Great, I made the fix, I also fix another issue regarding case insensitivity that lead us to a major downtime of Consul.
Since DNS is case insensitive and DB as issues when similar names with different cases are added, check for unicity based on case insensitivity.
Following another big incident we had in our cluster, we also validate that adding/renaming a not does not conflicts with case insensitive matches.
We had the following error once:
- one node called: mymachine.MYDC.mydomain was shut off
- another node (different ID) was added with name: mymachine.mydc.mydomain before 72 hours
When restarting the consul server of domain, the consul server restarted failed to start since it detected an issue in RAFT database because mymachine.MYDC.mydomain and mymachine.mydc.mydomain had the same names.
Checking at registration time with case insensitivity should definitly fix those issues and avoid Consul DB corruption.
Since DNS is case insensitive and DB as issues when similar names with different cases are added, check for unicity based on case insensitivity. Following another big incident we had in our cluster, we also validate that adding/renaming a not does not conflicts with case insensitive matches. We had the following error once: - one node called: mymachine.MYDC.mydomain was shut off - another node (different ID) was added with name: mymachine.mydc.mydomain before 72 hours When restarting the consul server of domain, the consul server restarted failed to start since it detected an issue in RAFT database because mymachine.MYDC.mydomain and mymachine.mydc.mydomain had the same names. Checking at registration time with case insensitivity should definitly fix those issues and avoid Consul DB corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@mkeeler DONE in last commit (and improved, see commit message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Great Work @pierresouchay
@pierresouchay We had to revert this as we found a bunch of edge cases and test failures caused by it. In debugging them, we realised that this is actually a breaking change to the API even if we did fix the edge cases which means we can't merge it until next major release (1.3) anyway. We'll respond with more details in a bit, just in middle of getting 1.2.1 ready. |
@pierresouchay Now for the full details. We found a bunch of tests that were failing that went something like this:
Step 3 there was failing due to the else block you added (at our request unfortunately) that was ensuring we didn't add a node with an id when that node name was already taken. The thought was that other users could be using Consul in a way that could be similar to our tests with registering services for a node that consul is running on and flipping things back and forth between having an id and then not having one (and eventually having AE syncing issues). After more investigation current Consul behavior allows two things that we didn't want to change in a patch release:
The PR had removed the ability to add a UUID to a node without one. While I think this is correct behavior, it probably needed to wait until a 1.3 release. What I think we should do in the short term (1.2.2 probably) is to reduce the scope of this PR slightly. In order to preserve the behavior of being able to add a uuid to a node I think all that has to be done is to just remove the else block where there is no existing node with the new nodes uuid. Then longer term we can introduce other changes to prevent adding a uuid to a node without one (maybe optional unless an option is specified) and to prevent removing uuids from a node. I think the easiest option (and the one that gives you credit for the work in the git history) is for you to open a new PR based off master with almost all the changes besides the one mentioned above. |
@mkeeler ok, I think I understand. What about having the same exact patch, but with an option such as We had several incidents related to renaming:
Pierre Souchay |
@pierresouchay I think the behaviour for the cases where either the existing or the registering node have no ID are pretty debatable and need more thought regardless of the fact they are a breaking change. And I think your issue is serious enough that it should be prevented. The way I understand it, your patch only needs a slight change to allow it to prevent all the conditions you saw in production (assuming all your clients have IDs) without changing any of the questionable edge cases around nodes in state store or registration requests without IDs. I don't think it's a breaking change to allow rename where IDs are present in both request and state store node with matching name and that seems to fix your actual cases right? So if you made those changes to the PR (and added tests to verify that the behaviour when either incoming or existing node has a blank ID remains unchanged) I think we get a PR that fixes all your legit issues for everyone, and is not backward incompatible so can be released before 1.3. Or do I misunderstand? |
@banks Ok, I am gonna have a try to update a review with the option then |
@banks I think you and I are on the same page @pierresouchay The bit that the PR added extra that was problematic was that it preventing updating the node registration with an ID if it had been registered without an ID. Removing this hunk of code should be sufficient to get rid of any compatibility issues: https://github.com/pierresouchay/consul/blob/fecae3de21accbde40ded20a1a05be25b9e27787/agent/consul/state/catalog.go#L386-L391 If for some reason that extra hunk was providing some necessary protection then you may want to think about adding an option to turn that on/off with off being the default. |
…#4413 This change allow to rename any well behaving recent agent with an ID to be renamed safely, ie: without taking the name of another one with case insensitive comparison. Deprecated behaviour warning ---------------------------- Due to asceding compatibility, it is still possible however to "take" the name of another name by not providing any ID. Note that when not providing any ID, it is possible to have 2 nodes having similar names with case differences, ie: myNode and mynode which might lead to DB corruption on Consul server side and lead to server not properly restarting. See hashicorp#3983 and hashicorp#4399 for Context about this change. Disabling registration of nodes without IDs as specified in hashicorp#4414 should probably be the way to go eventually.
* Allow to rename nodes with IDs, will fix #3974 and #4413 This change allow to rename any well behaving recent agent with an ID to be renamed safely, ie: without taking the name of another one with case insensitive comparison. Deprecated behaviour warning ---------------------------- Due to asceding compatibility, it is still possible however to "take" the name of another name by not providing any ID. Note that when not providing any ID, it is possible to have 2 nodes having similar names with case differences, ie: myNode and mynode which might lead to DB corruption on Consul server side and lead to server not properly restarting. See #3983 and #4399 for Context about this change. Disabling registration of nodes without IDs as specified in #4414 should probably be the way to go eventually. * Removed the case-insensitive search when adding a node within the else block since it breaks the test TestAgentAntiEntropy_Services While the else case is probably legit, it will be fixed with #4414 in a later release. * Added again the test in the else to avoid duplicated names, but enforce this test only for nodes having IDs. Thus most tests without any ID will work, and allows us fixing * Added more tests regarding request with/without IDs. `TestStateStore_EnsureNode` now test registration and renaming with IDs `TestStateStore_EnsureNodeDeprecated` tests registration without IDs and tests removing an ID from a node as well as updated a node without its ID (deprecated behaviour kept for backwards compatibility) * Do not allow renaming in case of conflict, including when other node has no ID * Fixed function GetNodeID that was not working due to wrong type when searching node from its ID Thus, all tests about renaming were not working properly. Added the full test cas that allowed me to detect it. * Better error messages, more tests when nodeID is not a valid UUID in GetNodeID() * Added separate TestStateStore_GetNodeID to test GetNodeID. More complete test coverage for GetNodeID * Added new unit test `TestStateStore_ensureNoNodeWithSimilarNameTxn` Also fixed comments to be clearer after remarks from @banks * Fixed error message in unit test to match test case * Use uuid.ParseUUID to parse Node.ID as requested by @mkeeler
This is a proposal for being able to rename nodes when their IP does not change.
It should fix #3974