Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow changing Node names since Node now have IDs #3983

Merged
merged 3 commits into from
Jul 11, 2018

Conversation

pierresouchay
Copy link
Contributor

This is a proposal for being able to rename nodes when their IP does not change.

It should fix #3974

@pierresouchay
Copy link
Contributor Author

I basically removed the IP address limitation since node ID is supposed to be immutable, furthermore it is very likely that a nodes does change its IPs as well as its name at the same time for operation reasons (changing node's domain for instance)

@pierresouchay pierresouchay changed the title Added support for renaming nodes when their IP does not change Allow changing Node names since Node now have IDs Jun 1, 2018
@kamaradclimber
Copy link
Contributor

Any news on that PR?

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one concern. If you can address that then I think we can merge this.

return fmt.Errorf("node ID %q for node %q aliases existing node %q",
node.ID, node.Node, n.Node)
// We are actually renaming a node, remove its reference first
err := s.deleteNodeTxn(tx, idx, n.Node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay One concern we have is that this will allow multiple nodes to use the same name with different UUIDs. This might break some things and at the very least could be confusing with regards to metrics and labels.

I think this could be remedied by refactoring this block of code and the lookup by name right after this. Essentially we would want to only delete the node when it has an id, the name is different and the name is not already associated with another node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkeeler Great, I made the fix, I also fix another issue regarding case insensitivity that lead us to a major downtime of Consul.

Since DNS is case insensitive and DB as issues when similar names with different cases are added, check for unicity based on case insensitivity.

Following another big incident we had in our cluster, we also validate that adding/renaming a not does not conflicts with case insensitive matches.

We had the following error once:

  • one node called: mymachine.MYDC.mydomain was shut off
  • another node (different ID) was added with name: mymachine.mydc.mydomain before 72 hours

When restarting the consul server of domain, the consul server restarted failed to start since it detected an issue in RAFT database because mymachine.MYDC.mydomain and mymachine.mydc.mydomain had the same names.

Checking at registration time with case insensitivity should definitly fix those issues and avoid Consul DB corruption.

Since DNS is case insensitive and DB as issues when similar names with different
cases are added, check for unicity based on case insensitivity.

Following another big incident we had in our cluster, we also validate
that adding/renaming a not does not conflicts with case insensitive
matches.

We had the following error once:

 - one node called: mymachine.MYDC.mydomain was shut off
 - another node (different ID) was added with name: mymachine.mydc.mydomain before
   72 hours

When restarting the consul server of domain, the consul server restarted failed
to start since it detected an issue in RAFT database because
mymachine.MYDC.mydomain and mymachine.mydc.mydomain had the same names.

Checking at registration time with case insensitivity should definitly fix
those issues and avoid Consul DB corruption.
Copy link
Contributor

@yfouquet yfouquet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pierresouchay
Copy link
Contributor Author

Just the one concern. If you can address that then I think we can merge this.

@mkeeler DONE in last commit (and improved, see commit message)

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Great Work @pierresouchay

@mkeeler mkeeler merged commit 98ead2a into hashicorp:master Jul 11, 2018
mkeeler added a commit that referenced this pull request Jul 12, 2018
This reverts commit 98ead2a, reversing
changes made to f8dcff8.
@banks
Copy link
Member

banks commented Jul 12, 2018

@pierresouchay We had to revert this as we found a bunch of edge cases and test failures caused by it.

In debugging them, we realised that this is actually a breaking change to the API even if we did fix the edge cases which means we can't merge it until next major release (1.3) anyway.

We'll respond with more details in a bit, just in middle of getting 1.2.1 ready.

@mkeeler
Copy link
Member

mkeeler commented Jul 13, 2018

@pierresouchay Now for the full details.

We found a bunch of tests that were failing that went something like this:

  1. Create a test agent (which generates a uuid for the node)
  2. Issue a Catalog.Register request without the UUID (this removes the UUID from the node)
  3. Later on do an anti-entropy state sync which includes the nodes UUID in an internal Catalog.Register RPC call.

Step 3 there was failing due to the else block you added (at our request unfortunately) that was ensuring we didn't add a node with an id when that node name was already taken.

The thought was that other users could be using Consul in a way that could be similar to our tests with registering services for a node that consul is running on and flipping things back and forth between having an id and then not having one (and eventually having AE syncing issues).

After more investigation current Consul behavior allows two things that we didn't want to change in a patch release:

  1. Removing a UUID from a node. This really shouldn't be allowed but it may be relied upon by other users accidentally (just like our tests)
  2. Adding a UUID to a node without one.

The PR had removed the ability to add a UUID to a node without one. While I think this is correct behavior, it probably needed to wait until a 1.3 release.

What I think we should do in the short term (1.2.2 probably) is to reduce the scope of this PR slightly. In order to preserve the behavior of being able to add a uuid to a node I think all that has to be done is to just remove the else block where there is no existing node with the new nodes uuid. Then longer term we can introduce other changes to prevent adding a uuid to a node without one (maybe optional unless an option is specified) and to prevent removing uuids from a node.

I think the easiest option (and the one that gives you credit for the work in the git history) is for you to open a new PR based off master with almost all the changes besides the one mentioned above.

@pierresouchay
Copy link
Contributor Author

@mkeeler ok, I think I understand.

What about having the same exact patch, but with an option such as allow_node_renaming defaulting to false for now (basically keeping old behaviour - maybe defaulting to true in later versions), and having the same exact behavior as my PR when set to true?

We had several incidents related to renaming:

  • several times, some minor ones (only nodes where impacted, but when those are your main DNS servers, it is not funny)
  • One very major as described in the commit message or node renaming, basically, one occurence of 0.7.3->0.8.4 upgrade leads to protocol version (2) is incompatible: [1, 0] #3217 ( => which forces us to restart each Consul server sequentially to fix it - the only workaround known to work in our experience) that did cause a database corruption on Consul servers (basically, upper/lower case in naming of 2 nodes alive at the same time) => we were unable to respawn our cluster server for several very long minutes (the leader was down, the other ones did not want to take lead due to DB corruption)

Pierre Souchay

@banks
Copy link
Member

banks commented Jul 13, 2018

@pierresouchay I think the behaviour for the cases where either the existing or the registering node have no ID are pretty debatable and need more thought regardless of the fact they are a breaking change. And I think your issue is serious enough that it should be prevented.

The way I understand it, your patch only needs a slight change to allow it to prevent all the conditions you saw in production (assuming all your clients have IDs) without changing any of the questionable edge cases around nodes in state store or registration requests without IDs. I don't think it's a breaking change to allow rename where IDs are present in both request and state store node with matching name and that seems to fix your actual cases right?

So if you made those changes to the PR (and added tests to verify that the behaviour when either incoming or existing node has a blank ID remains unchanged) I think we get a PR that fixes all your legit issues for everyone, and is not backward incompatible so can be released before 1.3.

Or do I misunderstand?

@pierresouchay
Copy link
Contributor Author

@banks Ok, I am gonna have a try to update a review with the option then

@mkeeler
Copy link
Member

mkeeler commented Jul 16, 2018

@banks I think you and I are on the same page

@pierresouchay
Allowing to rename a node when the ID matches AND the name isn't used elsewhere is safe and is something we can put in today for the next patch release. So I dont think we need a new option here.

The bit that the PR added extra that was problematic was that it preventing updating the node registration with an ID if it had been registered without an ID.

Removing this hunk of code should be sufficient to get rid of any compatibility issues: https://github.com/pierresouchay/consul/blob/fecae3de21accbde40ded20a1a05be25b9e27787/agent/consul/state/catalog.go#L386-L391

If for some reason that extra hunk was providing some necessary protection then you may want to think about adding an option to turn that on/off with off being the default.

@pierresouchay
Copy link
Contributor Author

@mkeeler @banks I made another try :) #4399

The PR is far more complicated since there were no options in Store, but it might be useful in the future anyway.

I kept the exact same semantics as it used to be, but you are free to activate it in a future Consul release by default

pierresouchay added a commit to pierresouchay/consul that referenced this pull request Jul 18, 2018
…#4413

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning
----------------------------

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See hashicorp#3983 and hashicorp#4399 for Context about this change.

Disabling registration of nodes without IDs as specified in hashicorp#4414
should probably be the way to go eventually.
mkeeler pushed a commit that referenced this pull request Aug 10, 2018
* Allow to rename nodes with IDs, will fix #3974 and #4413

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning
----------------------------

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See #3983 and #4399 for Context about this change.

Disabling registration of nodes without IDs as specified in #4414
should probably be the way to go eventually.

* Removed the case-insensitive search when adding a node within the else
block since it breaks the test TestAgentAntiEntropy_Services

While the else case is probably legit, it will be fixed with #4414 in
a later release.

* Added again the test in the else to avoid duplicated names, but
enforce this test only for nodes having IDs.

Thus most tests without any ID will work, and allows us fixing

* Added more tests regarding request with/without IDs.

`TestStateStore_EnsureNode` now test registration and renaming with IDs

`TestStateStore_EnsureNodeDeprecated` tests registration without IDs
and tests removing an ID from a node as well as updated a node
without its ID (deprecated behaviour kept for backwards compatibility)

* Do not allow renaming in case of conflict, including when other node has no ID

* Fixed function GetNodeID that was not working due to wrong type when searching node from its ID

Thus, all tests about renaming were not working properly.

Added the full test cas that allowed me to detect it.

* Better error messages, more tests when nodeID is not a valid UUID in GetNodeID()

* Added separate TestStateStore_GetNodeID to test GetNodeID.

More complete test coverage for GetNodeID

* Added new unit test `TestStateStore_ensureNoNodeWithSimilarNameTxn`

Also fixed comments to be clearer after remarks from @banks

* Fixed error message in unit test to match test case

* Use uuid.ParseUUID to parse Node.ID as requested by @mkeeler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conflicts appears when changing node_name on agents
5 participants