Continue to push assignment updates to nodes that were removed from the list #419
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When we are shrinking the size of a cluster and removing nodes from the coordinator config files, we are stopping the
NodeController
so that we don't react on errors coming from a failed node.The problem is that some of these nodes might still be online and used by the clients. For example, they might still be marked as "ready" by K8S and still serving the assignments dispatch to clients.
If the coordinator node controller stops, the node will not receive any update on new leader elections, and if a client is connected to an old (removed) node, it will still operate based on the old leader assignment.
Modifications
When a node is removed, we leave the NodeController running, though we change the state to
Draining
. When this node stops responding to health-checks, the node controller will then avoid retrying again and it will just finally cleanup the removed node completely.This will make sure that nodes removed from coordinator are still up to date with the current assignments, until the moment where they are finally shut down.