Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to rename nodes with IDs, will fix #3974 and #4413 #4415

Merged
merged 11 commits into from
Aug 10, 2018

Conversation

pierresouchay
Copy link
Contributor

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See #3983 and #4399 for Context about this change.

Disabling registration of nodes without IDs as specified in #4414
should probably be the way to go eventually.

…#4413

This change allow to rename any well behaving recent agent with an
ID to be renamed safely, ie: without taking the name of another one
with case insensitive comparison.

Deprecated behaviour warning
----------------------------

Due to asceding compatibility, it is still possible however to
"take" the name of another name by not providing any ID.

Note that when not providing any ID, it is possible to have 2 nodes
having similar names with case differences, ie: myNode and mynode
which might lead to DB corruption on Consul server side and
lead to server not properly restarting.

See hashicorp#3983 and hashicorp#4399 for Context about this change.

Disabling registration of nodes without IDs as specified in hashicorp#4414
should probably be the way to go eventually.
@pierresouchay
Copy link
Contributor Author

pierresouchay commented Jul 18, 2018

@banks @mkeeler Ok, I implemented exactly what you said in #4399 (comment)

The only differences are (apart from being a single commit):

Hope you will be satisfied with this

Kind regards

block since it breaks the test TestAgentAntiEntropy_Services

While the else case is probably legit, it will be fixed with hashicorp#4414 in
a later release.
@pierresouchay
Copy link
Contributor Author

@mkeeler I had to remove the else test for case insensitive search in my second commit since it broke test TestAgentAntiEntropy_Services. While that's probably fine for renaming, it means it won't fix properly #4413 :(

Still some room for further PRs then

enforce this test only for nodes having IDs.

Thus most tests without any ID will work, and allows us fixing
@pierresouchay
Copy link
Contributor Author

@mkeeler I think I got it right now. I just added && enode.ID != "" to avoid collisions with nodes without IDs and all tests do pass.

So, it basically fixes #4413 and #3974 again !

`TestStateStore_EnsureNode` now test registration and renaming with IDs

`TestStateStore_EnsureNodeDeprecated` tests registration without IDs
and tests removing an ID from a node as well as updated a node
without its ID (deprecated behaviour kept for backwards compatibility)
node.ID, n.Node, node.Node)
}
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else statement here is one of the problematic ones. I would recommend replacing with this statement.

}
// TODO: else statement missing to ensure name is not already taken by another node
// If added right now it could break existing behavior which allows for adding
// a UUID to a node that doesn't have one when the name is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkeeler I think the else statement is now SAFE since s.ensureNoNodeWithSimilarNameTxn will not match any node without an ID. It means it will break ONLY if: a node having an ID clashes (case insensitive) with another node having an ID.

In all other cases, the behaviour is unchanged as shown in tests deprecatedEnsureNodeWithoutIDCanRegister of agent/consul/state/catalog_test.go.

What works:

When request has an ID:

  • renaming a node when having an ID, without colliding with the name of another node (case insensitive)
  • creating a node with an ID for which the name is not taken by anybody (case insensitive)

When request has not ID (more permissive regarding case insensitivity)

  • Create a new node without ID with name not colliding
  • Remove a node ID of a node with an ID if name EXACTLY (case sensitive) matches
  • Create a new node without ID with name colliding with case insensitive differences

What does not work anymore due to this change:

  • Create a node with an ID, but this name collides with another one (case insensitive) having an ID (works if other node DOES not have any ID)

Previously, given the following configuration

id42 - node1

Registering same name with different ID is blocked
id43 - node2 => previously would 'steal' node2 to id42 ==> now blocked
id43 - NoDe2 => previously would create a second node (id42: node2 ; id43: NoDe2) => the big crash I had described here #4413 => now the registration is blocked and id43 won't be able to join

Conclusion

  • A request with an ID can "steal" a node without any ID, but cannot steal a node with an ID if ID is different
  • A request without any ID can do whatever it wants

So, basically, the ONLY new possible issue we have is:

  • if you change manually the ID of a node, this node won't be able to join the cluster
  • if 2 nodes have different IDs and case insensitive differences, only ONE can join, which is good new for DNS as well as it avoid the DB corruption I had.

What do you think of this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now.

We will error out if both if an existing node with a uuid != new uuid where there is a name match but we will still allow adding a uuid to a node that doesn't have one. You just moved the logic to only enforce it when they both have uuids into that function.

return fmt.Errorf("node ID %q for node %q aliases existing node %q",
node.ID, node.Node, n.Node)
// Lets first get all nodes and check whether name do match
dupNameError := s.ensureNoNodeWithSimilarNameTxn(tx, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug here.

Because the name check only will match if the other node has a UUID and the UUID doesn't match as well, this will potentially allow merging two existing nodes into 1.

An example of this is if there are two existing nodes: node foo has a uuid and node bar does not. If you attempt to rename foo to bar it will not error out for the name match since the bar node has no uuid. It will delete node foo and then add the uuid to the bar node and update all its info. This is definitely not desired.

Adding a uuid to a node (the else case where we had no uuid match) is okay to allow for upgrading to uuids. However we need to prevent this case where we have two existing nodes and we merge them into one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkeeler Well spotted, I introduced this bug after I saw TestAgentAntiEntropy_Services not working => should be fixed in next patch

Renaming is not allowed when another node, including without nodeID exists in next patch.

@pierresouchay
Copy link
Contributor Author

@mkeeler Something is wrong... I saw by adding some tests for renaming. It seems that index ("nodes", "uuid") is not properly updated... I think indexing using "uuid" for nodes does not work well.

I'll keep you updated

@pierresouchay
Copy link
Contributor Author

@mkeeler The last patch fix an existing bug: GetNodeID() and similar lookups were not working properly since index uuid was UUIDIndex while lookups were performed on strings. Thus, we were never entering in the test for renaming.

This fixes both and include tests cases that includes all we discussed, this time should be the good one ! :-)

…searching node from its ID

Thus, all tests about renaming were not working properly.

Added the full test cas that allowed me to detect it.
@pierresouchay
Copy link
Contributor Author

@mkeeler the last patch explains why I never got the "aliases existing node" in prod when renaming nodes, basically, the feature was not working at all since lookup per node uuid did not work at all, what do you think about the change now?

@pierresouchay
Copy link
Contributor Author

Hello @banks & @mkeeler

In my last patches, I just improved the error messages when GetNodeId() get a wrong UUID and improved the test coverage for GetNodeId() function

So basically this review:

  • "node ID %q for node %q aliases existing node %q" => was never working due to wrong lookup by UUID since GetNodeID() nor the test for entering this condition was working=> always in the case of renaming anyway (and not protected against case insensitivity)
  • GetNodeID() is now working as expected (was not working due to indexation working on UUID binary representation while lookup performed on string
  • Changed the behavior as explained above: Allow to rename nodes with IDs, will fix #3974 and #4413 #4415 (comment)

I will wait for a review anyway now, I think we are good

@pierresouchay
Copy link
Contributor Author

Travis test are not stable once again :(

@pierresouchay pierresouchay force-pushed the node_renaming_simple branch 5 times, most recently from e602ccc to 4ec58fb Compare July 26, 2018 10:50
More complete test coverage for GetNodeID
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pierresouchay I could be wrong still but this is getting pretty complicated to understand. I think one of the specific breaking changes from before is at-best untested here and potentially broken again (although the tests it would incidentally have broken before seem to pass for now).

Can we add an explicit test both to document the intent and to be convinced this is safe now? I tried to explain what I mean inline - hope it's clear enough!

for nodeIt := enodes.Next(); nodeIt != nil; nodeIt = enodes.Next() {
enode := nodeIt.(*structs.Node)
if strings.EqualFold(node.Node, enode.Node) && node.ID != enode.ID {
if !(enode.ID == "" && allowClashWithoutID) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cover the case where the new node (node.ID == "") but clashing node isn't? That is specifically what broke tests before.

I'll read on and may well find a reason why that can't happen here but just noting. Even if that is a true invariant might be worth commenting because we want to allow either the existing or the new node to have blank ID without changing behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not, but basically yes since this function is called ONLY when node.ID != ""

In my next and (hopefuly) last patch, I did not change anything but added tests to this function.

To summarize

When ID is provided:

  • if the node already exists, you rename the node if no conflicts with another node
  • if node does not exists, check for conflicts, but ONLY for nodes having ID (thus we allow to "upgrade" a node without ID and add an ID)

When ID is not provided:

  • free lunch, you can "downgrade" a node by removing its ID - BUT ONLY if you have the exact same name as an existing node without ID (Kept for your tests and compatibility reasons)
  • you can create a node of course

So, it means this PR is fixing #4413 (not being able to have two different agents with with that differs only by their case) ONLY in clusters where everybody is playing nice and has an ID (which is your case in our clusters :-)

As long as you have agents that register without any ID, you are in danger, you may have node2 and NoDe2. But we keep all of this for the ascending compatibility with old clusters as well as your tests

}
}
// TODO: a else statement is missing here to check we are not stealing
// a similar case-insensitive name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate? I thought that is what this PR is meant to fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing it in next patch.

if out.CreateIndex != 1 || out.ModifyIndex != 7 {
t.Fatalf("[DEPRECATED] bad CreateIndex/ModifyIndex returned: %#v", out)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test for the reverse case which is behaviour our test suite (and possibly other clients) rely on:

  • A node is registered with an ID (by a real agent starting)
  • Later tooling re-registers the same node (say to update it's node meta) but doesn't bother to specify the ID field (because it never had to before)

The above should continue to work and replace the existing node with the new one without an id.

All that said, CI shows that the same tests are not broken right now but they were only reliably broken before due to another change we reverted. They could still be non-deterministically broken since timing was involved.

Specifically in tests like TestPrepaparedQuery_Execute:

  • The test agent starts up assigns itself a UUID and in the background Anti-entroypy eventually registers that with the server (normally in milliseconds but that is enough of a delay to make it non-deterministic...)
  • Then actual test code like this:
    if err := msgpackrpc.CallWithCodec(codec1, "Catalog.Register", &req, &reply); err != nil {
  • Re-registers the same node name to change it's health status. This registration happens not to have an ID so any of the following orderings can happen:
    • Agent AE gets there first so the existing node in state does have an ID while the incoming registration doesn't - this is the case that is missing an explicit test here I think
    • test code registers first with no ID and then the AE upgrades it to have an ID. Then subsequent code in setHealth registers again without an ID.

It's pretty clear that the test is racey anyway regardless of the ID thing, but this is why we think it's important that we don't break the case where and existing node DOES have an ID but a registration doesn't specify one. We can always fix this test and arguably should, but other people might have legitimately built something that does something similar to this test and we can't break them.

Is that clear?

Now it might be that your code is correct already and will always pass that test regardless of timing/interleaving - esp. given that it's not failing now. but I don't see an explicit test case here that verifies this behaviour and the logic is complicated enough I'm not sure I trust my self to reason about if it works alone 😄.

Could we add that?

Also fixed comments to be clearer after remarks from @banks
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pierre, this is looking good. I still had to think really hard here and a minor tweak noted. I think Matt has another suggestion but looks like we are getting close so I'll approve my concerns here.

t.Fatalf("Should return an error since another name with similar name exists")
}
if err := s.ensureNoNodeWithSimilarNameTxn(tx, node, true); err != nil {
t.Fatalf("Should return an error since another name with similar name exists")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh I was about to write a whole other thing about how this still doesn't exercise the no-ID free lunch case we've been talking about but then I looked closer and spotted this final case has err != nil 😄 .

The message here though still asserts the opposite of the condition though which is what confused me - it should be "Should not return an error since re-registering an identical node name without an ID should not be broken" or something.

Other than that I think this is good now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@banks Thank you, fixed the error message in test case in next patch!

@pierresouchay
Copy link
Contributor Author

@mkeeler Do you feel comfortable with the PR now?

Kind Regards

@pierresouchay
Copy link
Contributor Author

@mkeeler Is the review Ok now?

We had 3 incidents (including one major one) to to the "renaming", this PR is really important to us.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one remaining thing with using go-uuid to parse the UUID instead of custom code.

@@ -421,6 +463,32 @@ func (s *Store) GetNode(id string) (uint64, *structs.Node, error) {
return idx, nil, nil
}

func getNodeIDTxn(tx *memdb.Txn, id types.NodeID) (*structs.Node, error) {
strnode := string(id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uuid parsing should be done with the ParseUUID function from github.com/hashicorp/go-uuid. An example of this is in the Catalog Register RPC endpoint: https://github.com/hashicorp/consul/blob/master/agent/consul/catalog_endpoint.go#L37

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@mkeeler
Copy link
Member

mkeeler commented Aug 3, 2018

Sorry about that @pierresouchay . Apparently I had done everything besides hit the "Submit Review" button on Monday

@pierresouchay
Copy link
Contributor Author

Sorry about that @pierresouchay . Apparently I had done everything besides hit the "Submit Review" button on Monday

@mkeeler No problem :-) The fix is DONE regarding the parsing of uuid anyway.

Have a great Week-End or Holidays

@mkeeler
Copy link
Member

mkeeler commented Aug 3, 2018

Before merging this I am going to pull down the branch and make sure all unit tests run as expected. Travis seems to always have different errors each time.

@pierresouchay
Copy link
Contributor Author

@mkeeler Sure... I am trying to fix Travis, but it takes time: #4472 :( almost there

@pierresouchay
Copy link
Contributor Author

@mkeeler Since #4472 has been merged, do you want me to rebase so we can see whether the tests do pass now?

@pierresouchay
Copy link
Contributor Author

@mkeeler any news on your tests?
That's currently the Consul bug that causes the most incidents in our clusters, we really would like to include this fix in next releases, that's would be great

@mkeeler
Copy link
Member

mkeeler commented Aug 10, 2018

@pierresouchay All tests pass locally. So I will go ahead and merge this.

@mkeeler mkeeler merged commit ef3b81a into hashicorp:master Aug 10, 2018
@pierresouchay
Copy link
Contributor Author

@mkeeler \o/ Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants