Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add proposal for API to automatically spread MachineSets across AZs. #483

Closed
wants to merge 1 commit into from

Conversation

dgoodwin
Copy link
Contributor

No description provided.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 21, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 21, 2020
@dgoodwin
Copy link
Contributor Author

Rough start on this, need a lot of collaboration with the cloud team to fill this out, just the first few sections worked on so far to get the discussion going.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial thoughts, I think we should try to investigate if keeping the AZ spread within MachineSets is viable before we start work on a new CRD controller. Will bring up the topic within the cloud team architecture call next week to discuss


### Implementation Details/Notes/Constraints [optional]

Implementation would require a new CRD and controller aligned hopefully with a pre-existing operator in OCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is upstream discussion about adding this first part into MachineSets (kubernetes-sigs/cluster-api#3358), do we want to explore this as an alternative before we make a decision on implementing a new CRD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would 100% prefer this, it's dramatically simpler. If there's a chance, that would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to walk back my comment above, after discussing with my team this would not help us with either use case really as MachineSet still requires substantial and non-trivial delicate settings that vary from OpenShift release to release. This makes it difficult for customers to just add their own, and for Hive to add them for customers and stay in sync with what the product expects.

Examples: AMI ID, many fields with the infra ID specified in the name (label selectors, iamInstanceProfile, security group filters, subnet, tags), userDataSecret (name changed twice during the 4.6 cycle, once to new and then back, nearly caused 4.6 blocker because OSD could not install GCP as a result this week).

We still think the best approach here is to make an API to allow customers to easily add new pools of machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it helps or not, but we added a number of default options for MachineSets in the 4.6 cycle which has drastically reduced the number of options that have to be specified (as the defaulting webhook will fill them in on create anyway).

My concern over creating an extra CRD that creates MachineSets is that it will still need to have these kinds of options wouldn't it? Eg what if a customer wants to change things such as the subnet? Or will this just be a convenience with a limited set of options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the default hook cover everything I mentioned above? In conjunction with spreading over AZs, that might be workable. I will run this past the team.

No the machine pool API would not expose those settings, it's just meant to be a dead simple interface as it is today in hive and the InstallConfig. If you need control over subnets etc, you can go straight to MachineSet.


The code for spreading MachineSets across AZs [already exists today](https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/machinesets.go) and could be removed from the installer, which would then just pass through an instance of the new API.

The [Hive MachinePool API](https://github.com/openshift/hive/blob/master/pkg/apis/hive/v1/machinepool_types.go) also includes support for enabling auto-scaling on the resulting MachineSets, and this functionality should be carried over to the new API even if not initially supported by the Installer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a tangential issue to the main aim of the proposal, do we want to include this here or can we drop this for now and narrow the scope to just the multi-az support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat depends on what the outcome is here. If we end up with upstream cluster-api MachineSets supporting multiple AZs, Hive likely then just has to sync one stable MachineSet and one stable Autoscaler config, which seems ok.

However if that is not the case and we do end up with a new MachinePool API of some form, then Hive would be syncing a singular MachinePool but then would need to introspect what was done to sync a per MachineSet autoscaling config. If we've got a higher level construct in-cluster, IMO that needs to encompass the autoscaling config to apply to what it generates.


This is where to call out areas of the design that require closure before deciding
to implement the design. For instance,
> 1. Upstream cluster-api project has begun using the MachinePool naming for an API that is backed by native cloud provider auto-scaling groups. (ASGs, MIGs) We however are already using this term in both the Installer and Hive. Is it acceptable to push this naming in cluster even if it may one day overlap with a cluster-api type in a different apigroup?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we came up with a new name for this as I would expect that at some point we will adopt MachinePools from the cluster-api proejct and I think having two resources in different API groups which do rather different things would be confusing for users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is really unfortunate. MachinePool is a thing in the installer InstallConfig and in Hive.

We can work with a new API name though but it's going to be confusing for a long time to come.

to implement the design. For instance,
> 1. Upstream cluster-api project has begun using the MachinePool naming for an API that is backed by native cloud provider auto-scaling groups. (ASGs, MIGs) We however are already using this term in both the Installer and Hive. Is it acceptable to push this naming in cluster even if it may one day overlap with a cluster-api type in a different apigroup?
> 1. How does this feature related to MachineDeployments?
> 1. Do we have an existing operator which would be a good home for this controller?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest the MAO, however it's getting a bit bloated lately 🤔

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 6, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants