-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add proposal for API to automatically spread MachineSets across AZs. #483
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Rough start on this, need a lot of collaboration with the cloud team to fill this out, just the first few sections worked on so far to get the discussion going. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial thoughts, I think we should try to investigate if keeping the AZ spread within MachineSets is viable before we start work on a new CRD controller. Will bring up the topic within the cloud team architecture call next week to discuss
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
Implementation would require a new CRD and controller aligned hopefully with a pre-existing operator in OCP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is upstream discussion about adding this first part into MachineSets (kubernetes-sigs/cluster-api#3358), do we want to explore this as an alternative before we make a decision on implementing a new CRD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would 100% prefer this, it's dramatically simpler. If there's a chance, that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to walk back my comment above, after discussing with my team this would not help us with either use case really as MachineSet still requires substantial and non-trivial delicate settings that vary from OpenShift release to release. This makes it difficult for customers to just add their own, and for Hive to add them for customers and stay in sync with what the product expects.
Examples: AMI ID, many fields with the infra ID specified in the name (label selectors, iamInstanceProfile, security group filters, subnet, tags), userDataSecret (name changed twice during the 4.6 cycle, once to new and then back, nearly caused 4.6 blocker because OSD could not install GCP as a result this week).
We still think the best approach here is to make an API to allow customers to easily add new pools of machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it helps or not, but we added a number of default options for MachineSets in the 4.6 cycle which has drastically reduced the number of options that have to be specified (as the defaulting webhook will fill them in on create anyway).
My concern over creating an extra CRD that creates MachineSets is that it will still need to have these kinds of options wouldn't it? Eg what if a customer wants to change things such as the subnet? Or will this just be a convenience with a limited set of options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the default hook cover everything I mentioned above? In conjunction with spreading over AZs, that might be workable. I will run this past the team.
No the machine pool API would not expose those settings, it's just meant to be a dead simple interface as it is today in hive and the InstallConfig. If you need control over subnets etc, you can go straight to MachineSet.
|
||
The code for spreading MachineSets across AZs [already exists today](https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/machinesets.go) and could be removed from the installer, which would then just pass through an instance of the new API. | ||
|
||
The [Hive MachinePool API](https://github.com/openshift/hive/blob/master/pkg/apis/hive/v1/machinepool_types.go) also includes support for enabling auto-scaling on the resulting MachineSets, and this functionality should be carried over to the new API even if not initially supported by the Installer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a tangential issue to the main aim of the proposal, do we want to include this here or can we drop this for now and narrow the scope to just the multi-az support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat depends on what the outcome is here. If we end up with upstream cluster-api MachineSets supporting multiple AZs, Hive likely then just has to sync one stable MachineSet and one stable Autoscaler config, which seems ok.
However if that is not the case and we do end up with a new MachinePool API of some form, then Hive would be syncing a singular MachinePool but then would need to introspect what was done to sync a per MachineSet autoscaling config. If we've got a higher level construct in-cluster, IMO that needs to encompass the autoscaling config to apply to what it generates.
|
||
This is where to call out areas of the design that require closure before deciding | ||
to implement the design. For instance, | ||
> 1. Upstream cluster-api project has begun using the MachinePool naming for an API that is backed by native cloud provider auto-scaling groups. (ASGs, MIGs) We however are already using this term in both the Installer and Hive. Is it acceptable to push this naming in cluster even if it may one day overlap with a cluster-api type in a different apigroup? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we came up with a new name for this as I would expect that at some point we will adopt MachinePools from the cluster-api proejct and I think having two resources in different API groups which do rather different things would be confusing for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is really unfortunate. MachinePool is a thing in the installer InstallConfig and in Hive.
We can work with a new API name though but it's going to be confusing for a long time to come.
to implement the design. For instance, | ||
> 1. Upstream cluster-api project has begun using the MachinePool naming for an API that is backed by native cloud provider auto-scaling groups. (ASGs, MIGs) We however are already using this term in both the Installer and Hive. Is it acceptable to push this naming in cluster even if it may one day overlap with a cluster-api type in a different apigroup? | ||
> 1. How does this feature related to MachineDeployments? | ||
> 1. Do we have an existing operator which would be a good home for this controller? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest the MAO, however it's getting a bit bloated lately 🤔
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.