Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Managing Control Plane machines proposal #292
Add Managing Control Plane machines proposal #292
Changes from 2 commits
5d3ca69
d2da065
5ef75fd
ef523ae
465a731
99c6396
1e33561
981d894
5e56429
2aa9977
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we do intended to support should be moved to goals, which is "simplify disaster recovery of a cluster that has lost quorum" (and then later describe what we do offer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see this in-scope as well. This is the critical scenario when we have lost etcd and the cluster is broken. We hit this with an OSD cluster last weekend, 2 masters offline and quorum lost. Guarding against this with other goals noted in this enhancement is an improvement but without the ability to recover from lost quorum there is still a huge risk carried in supporting clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only affects to scaling and ensuring compute resources. Any operational aspect of etcd recovery should be handled by the etcd operator orthogonally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be automatic by the platform instead of install - install creates instances, machine sets are created by a more core operator that respects
infrastrucutre.spec.managedMasters / masters
. That supports customers upgrading.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’d be very complex to materialise an opinionated expected state for something like infra.spec.managedMasters in existing clusters.
There are no strong guarantees about cluster topology or safe assumptions we can do on already existing clusters/environments for an flag to instantiate and manage these objects. I’m concern this would create more rough edges than it would solve.
I think we should make no difference with workers. I don’t think we should have a semantic for managedMasters the same way we don’t have one for managedWorkers. MachineSets are building blocks compute resources.
Before there was a limitation on etcd that prevent master compute resources from being machineSets. Now that’s fixed at the right layer and shipped on upgrades (etcd operator).
So we just want all compute resources to be machineSets out of the box. And as for existing clusters, as an admin I now know that creating machineSets for masters is safe so it’s my recommended choice as for any other machine.
The installer is just defaulting the initial ha topology for all the machines in a cluster. MachineSets is not more. It could actually supersede current masters machines objects and terraform code in the installer, just like we do for workers (but we want to keep that step separate from this proposal to maintain scope).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To mitigate this concerns and alleviate the upgrade burden from the user I introduced controlPlane CRD and controller managed by the Machine API Operator. Opt-in for existing clusters requires no more than just instantiate this object.
This also let us have more control to limit the underlying machineSet capabilities to prevent an undesired to high degree of flexibility, and reserve the ability to relax it later in time.
See 5ef75fd
PTAL @smarterclayton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it woulrd be good to detail a reasonably complete test plan for each of these (for instance, for unhealthy node emulation alay already owns a test, describe the scenario here so we can review it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done