Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for spot instances #1876

Closed
enxebre opened this issue Dec 11, 2019 · 40 comments
Closed

Support for spot instances #1876

enxebre opened this issue Dec 11, 2019 · 40 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@enxebre
Copy link
Member

enxebre commented Dec 11, 2019

User Story

As a user I would like to leverage the cluster-api and particularly machines to run "spot" instances so I can run specific workloads on cheaper instances.

Detailed Description
This is to get started a discussion for possible paths forward for supporting spot instances.

Spot instances (aka preemptible vms in gcp or low-priority vms in azure) have different API exposure and different behavioural peculiarities for each cloud provider.
We'd need to come up with an API/behavioural cross provider abstraction that we want to expose and see if that fits with the existing machine/machineSet/machineDeployment abstraction or otherwise a new resource abstraction in addition should emerge. E.g:
Do we want to support the concept of fleets as in aws? Do we only support individual requests?
What do we do when a spot instance is terminated? Do we let the spot request keep trying?

  • As a very rough guess we could expose a gated list of each specific provider parameters for each infra providerSpec level and standardise/enforce the behaviour at the machine controller level as follows:

    • We only integrate with the provider semantics for individual spot requests, i.e 1 machine - 1 spot instance request is an invariant, i.e same mapping that's done for regular instances.
      For groups of spot instances this might be relevant 📖CAEP: MachinePool API (provider agnostic scale types) #1703
    • We try to ensure spot requests are fire and forget. When the instance is terminated by the provider, the controller cancel the request, the Machine goes failed.
    • MachineSet constantly reconciles towards number of replicas may be ignoring failed machines. Therefore new machine/spot requests are constantly re-created after one is canceled.
  • Alternatively a new abstraction might be created the likes of spotMachine or spotPool

  • Pods running on a spot instance won’t be shutdown gracefully. This needs a particular solution for each provider. E.g https://github.com/aws/aws-node-termination-handler/blob/master/README.md

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 11, 2019
@enxebre
Copy link
Member Author

enxebre commented Dec 11, 2019

@michaelgugino
Copy link
Contributor

I think we should start with machines on a per-provider basis. I'm not sure we need a common abstraction at this point. Here's what I came up with for AWS: https://github.com/openshift/cluster-api-provider-aws/pull/255/files#diff-d1ff143ff12272410f75d9d545d45ec1R144-R152

Each provider is going to be handled differently. For instance, on AWS, you're creating a spot instance request, that may or may not be filled depending on current usage and your max bid. This might be desireable by the end user: I have some work I need done, and it can wait until the price is X. GCP is different. If the machine is deleted before the request is fulfilled, we need to delete the request rather than an instance (this is trivial, hasn't been implemented yet). The other consideration here is the IAM policies for Spot Instances require additional privileges.

Once we have spot instance support in a particular cloud, a user could create machinesets to easily add and remove spot instances/requests. I think regular machinesets are fine for now. Users might want to taint those nodes to ensure only interruptible workloads are on thoses machines. AWS will notify a machine that it's about to go down, and there's already some existing code to drain the node: https://github.com/kube-aws/kube-spot-termination-notice-handler

Obviously, you shouldn't deploy your entire cluster on spot instances (in 80% use case I'd argue), so we just need to document what the current behavior is, and people can opt-in to using them if that's useful.

From there, I think we should build provider-specific machine-spot-set controllers, or some other type of abstraction that can balance machines across AZs and instance-types, as per the recommended practices from AWS to minimize disruption. Or, another abstraction would be to create the ability for the machineset to 'deletegate' function. Leave the template empty, just use the replica count, and something else responds to the desired number of instances by watching that machineset.

@ncdc ncdc added this to the Next milestone Dec 11, 2019
@ncdc ncdc added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 11, 2019
@enxebre
Copy link
Member Author

enxebre commented Dec 11, 2019

what I'd like is to consolidate and enforce behaviour for the machine resource across providers at the machine controller level for the case when a given cloud provider decides to interrupt and terminate a spot instance. So this behaviour is consistent, predictable and upper level controllers can react appropriately, i.e I'd probably expect the machine to go failed in such a case.

@michaelgugino
Copy link
Contributor

what I'd like is to consolidate and enforce behaviour for the machine resource across providers at the machine controller level for the case when a given cloud provider decides to interrupt and terminate a spot instance. So this behaviour is consistent, predictable and upper level controllers can react appropriately, i.e I'd probably expect the machine to go failed in such a case.

From the machine-controller's POV, there shouldn't be anything new. The instance goes away for one reason or another, and we react accordingly, in the default case, do nothing, that's up to the remediation layer if desired.

The day-1 use case I'm thinking of is, I scale up say 20 spot instances, and they process through my batch jobs. After 24 hours, I might scale them down. During that time, if a handful of instances go away, I'm not really concerned about it. After that 24 hours, I scale the machineset down to zero, and I might repeat the same thing next week.

@detiber
Copy link
Member

detiber commented Dec 11, 2019

I think one potential issue here is that we'd need to react to the signal to shutdown the spot instance in order to fasttrack the cordon/drain of the Node faster than we would otherwise need to do for other types of remediation.

@michaelgugino
Copy link
Contributor

I think one potential issue here is that we'd need to react to the signal to shutdown the spot instance in order to fasttrack the cordon/drain of the Node faster than we would otherwise need to do for other types of remediation.

@detiber I've linked a project that does just that in my first comment. Someone could optionally deploy that thing (I believe it runs as a daemonset) if they so chose. Or we can develop a different daemonset to notify some other system. In any case, I don't view this as day-1 feature, it's perfectly fine to have workers that might disappear quickly, I would suggest taints as a starting point.

@enxebre
Copy link
Member Author

enxebre commented Feb 24, 2020

/assign @JoelSpeed

@JoelSpeed
Copy link
Contributor

I've put together a proposal which I discussed a bit with @vincepri last week. I've added a large section towards the bottom of this about the state of Spot instances across AWS, Azure and GCP, so if anyone is unfamiliar with them or would like a refresher on how they work, please read the additional details I've added.

The proposal TL;DR is basically that:

  • Each cloud provider (AWS, Azure, GCP) supports semantics for starting single instances using spot instances using the same mechanism as is currently used for Machines
  • A field can be added to each provider's MachineSpec (eg AWSMachineSpec which would make the machines be created as spot instances
  • There are other fields that need to be set on certain providers to make the Spot Instances behave in the same way On-Demand do currently, but this can be done internally, no need to expose to the user
  • I believe we can introduce support for spot instances without breaking the existing assumptions for Machine{Deployment,Set,}s
  • Support for termination events should be included as an MVP item

Vince had some thoughts around this (correct me if I'm wrong Vince):

  • Adding extra fields to MachineSpec might hinder ability for types to be promoted from Alpha
  • Termination handling can probably be pushed to a second stage and excluded from this proposal
  • Perhaps the configuration semantics could be abstracted from the provider and added to MahcinePoolSpecs (which is an unstable API already, so could make breaking changes here easily) and then handled by MachinePool implementations, skipping support initially for individual spot instances
  • Cloud providers have operational expertise for running lots of spot instances, makes sense that we leverage this expertise, but this is done via MachinePool

I'd like to gather feedback from the Owners of the Azure, GCP and AWS providers on the proposal to add fields to their MachineSpecs and their thoughts on the stability and implications of adding these fields as Vince has raised. I don't know who to poke to gather this feedback however, so if anyone knows who I should be poking, please let me know 😃

enxebre added a commit to enxebre/cluster-api that referenced this issue Mar 4, 2020
enxebre added a commit to enxebre/cluster-api that referenced this issue Mar 4, 2020
@JoelSpeed
Copy link
Contributor

To give an update on this, I synced up with the CAPA maintainers at their call on Monday (March 9th), and with @CecileRobertMichon representing CAPZ today (if anyone else seeing this has questions and wants a call to sync, please ping me)

Both seemed to support the initial proposal and direction of trying to get consistent looking approaches to the feature across AWS/GCP/Azure, the current direction I've been agreeing with people is:

  1. Implement support for single Machine spot instances in this proposal via adding a field to each InfrastructureTemplate
  2. Defer TerminationHandling to a second proposal
  3. Follow up with MachinePool proposal once MachinePool is stabilised better

From these calls I have a couple of questions to answer:

  • Need to test spot on Azure single instances to verify the assumptions within the doc (Already tested for AWS and GCP)
  • What happens if Azure Spot VMs don't make it out of Preview, what do we do then? Can we drop the field?

If there are more questions, please add to the doc and I will try to address

@vincepri
Copy link
Member

Implement support for single Machine spot instances in this proposal via adding a field to each InfrastructureTemplate

Does this have any impact on Cluster API controllers and types?

@JoelSpeed
Copy link
Contributor

Does this have any impact on Cluster API controllers and types?

No it shouldn't do, in terms of implementation it should basically be adding a field to each of the InfrastructureMachineTemplate and plumbing it through to the call that creates the Cloud VM.

The only changes I expect within this repo are within the docs

@vincepri
Copy link
Member

Got it, it sounds like we might need to follow-up in each infrastructure provider with some issues and keep this one open for the next cycle once we look at MachinePool

@CecileRobertMichon
Copy link
Contributor

Linked the capz issue ^

@michaelgugino
Copy link
Contributor

What happens if Azure Spot VMs don't make it out of Preview, what do we do then? Can we drop the field?

From my POV, the provider fields are less rigid. They shouldn't be forced to support things they don't want to support. It's up to their each provider to specify an API migration story. "We don't think anyone is using this and it's broken anyway" is reason enough for me to drop it from a particular provider.

@vincepri
Copy link
Member

the provider fields are less rigid

I can see that, but within SIG sponsored sub-projects we're trying to adhere to the usual Kubernetes API guarantees. In this case, after adding a field, it should wait until the next cycle to be dropped / deprecated which is probably ok-ish

@detiber
Copy link
Member

detiber commented Mar 12, 2020

The providers should adhere to API contracts, so a change to remove a field should be handled accordingly.

@michaelgugino
Copy link
Contributor

In this case, after adding a field, it should wait until the next cycle to be dropped / deprecated which is probably ok-ish

I agree with deprecating and removing over a release, or just removing the next release if it's clear that the field is not GA. I'm not advocating we just remove it mid release.

The providers should adhere to API contracts

What the providers do as far as their fields and implementations should be transparent to cluster-api. We've got some standards, but it's outside of our domain to dictate what is best for each provider IMO. In the case of a provider that lives in k8s-sigs org, yeah, we should probably do the k8s way of things. But I view providers as things you replace and/or modify to your own needs.

@CecileRobertMichon
Copy link
Contributor

What happens if Azure Spot VMs don't make it out of Preview

to clarify, my comment was more about how to expose the field while the feature is still in preview. I'm not too concerned about the feature not making it out of preview.

@vincepri
Copy link
Member

This is why I originally asked to make all of this an experiment and have it only be under MachinePool, it's way easier to make changes to something that hasn't been published yet

@CecileRobertMichon
Copy link
Contributor

This is why I originally asked to make all of this an experiment and have it only be under MachinePool, it's way easier to make changes to something that hasn't been published yet

@JoelSpeed what do you think about that? It does make sense to use MachinePool since the Spot feature would also be experimental at first. Also it might be a forcing function for all the infra providers to adopt MachinePools :)

@michaelgugino
Copy link
Contributor

For just Azure, using MachinePools is fine. For the other providers, I don't think we should gate on how MachinePools works because they are the least mature.

@enxebre
Copy link
Member Author

enxebre commented Mar 13, 2020

@CecileRobertMichon @vincepri I'd expect this to be exposed at the InfrastructureMachineTemplate level. This can then be complementarily leveraged by an infraMachine implementation, a machinePool implementation or anything wanting to consume that API. But regardless where we choose to implement it the API changes would go into InfrastructureMachineTemplate.

Personally I don't see an issue with Azure Spot Virtual Machines being Preview https://azure.microsoft.com/en-us/pricing/spot/. I see that orthogonal to the maturity of our API. We would just refer CAPI users to the cloud provider docs for that particular feature expectations while guaranteeing our API contracts nevertheless if Azure Cloud ever drops support for this.
Edit: Also if there's strong concerns about this, we could just defer exposing this in capz for now.

@detiber
Copy link
Member

detiber commented Mar 13, 2020

I agree that I don't think we want to tie this to heavily into MachinePools, since not all users will want the MachinePool semantics.

@JoelSpeed
Copy link
Contributor

I'm in the camp with Michael, Alberto and Jason on this one. We know the APIs for AWS and GCP are stable as they haven't changed in a long time, so this problem of perhaps being uncertain about the Cloud API or having to signal to users that this is a preview feature only really applies to Azure, it makes me beg the question as to whether we should just delay implementing in Azure but that also isn't ideal as I wanted to try and make sure we are consistent with this feature across the providers who do support it

Honestly I don't think the API field changing is likely to be a problem. Azure Spot is their second iteration of this (they're stopping low-priority which was the first iteration) so I would imagine this will be more stable and probably make it out of preview without many user facing changes

I don't really want to hold up support on AWS and GCP to wait for MachinePool as we currently have no timeline for when those are due. And as Jason says, I think we will likely have users who want to use this feature with individual machines, and probably, machinepools in the future too.

to clarify, my comment was more about how to expose the field while the feature is still in preview. I'm not too concerned about the feature not making it out of preview.

Apologies, I may have misunderstood that when I was typing this up, my note on it from the call wasn't particularly great. In terms of signalling it, I think we just make a note in the docs to users saying this is a preview feature and please don't use it in production and link to the Azure docs which say the same.

@JoelSpeed
Copy link
Contributor

I've just gone through and updated the doc based on comments given in the doc and moved the termination handling stuff to a future work section so that we can focus on just the API fields and patching the options through to the create instances calls within the first pass of spot support

@vincepri
Copy link
Member

I'm fine with either approach. The potential issue I'd like to avoid is to have changes made to the CAPI controllers (especially Machine) to take into account spot semantics. This doesn't seem to be the case today given the conversations above, so 👍 if the infrastructure providers are fine supporting it.

@JoelSpeed
Copy link
Contributor

I've opened a PR to add the proposal formally #2817

@CecileRobertMichon
Copy link
Contributor

to clarify, my comment was more about how to expose the field while the feature is still in preview. I'm not too concerned about the feature not making it out of preview.

https://azure.microsoft.com/en-us/updates/azure-spot-virtual-machines-are-now-generally-available/

Looks like we don't need to worry about that anymore 🙂 cc @JoelSpeed

@JoelSpeed
Copy link
Contributor

Thanks for the heads up @CecileRobertMichon, will update the proposal 🙂

@vincepri
Copy link
Member

@JoelSpeed can we close this one now that the proposal has been merged?

@vincepri
Copy link
Member

We probably should follow-up with a documentation task to write up in the book what infrastructure providers are expected to follow as convention

@JoelSpeed
Copy link
Contributor

JoelSpeed commented Apr 21, 2020

Yep, next steps I would say are:

  • Ensure CAPA, CAPG, CAPZ have tracking issues for adding support
  • Update documentation to ensure convention is documented

@vincepri
Copy link
Member

/kind documentation

@vincepri
Copy link
Member

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Apr 21, 2020
@k8s-ci-robot k8s-ci-robot modified the milestones: Next, v0.3.x Apr 21, 2020
@vincepri
Copy link
Member

@JoelSpeed @enxebre can this issue be closed?

@JoelSpeed
Copy link
Contributor

I think we still need to put some docs into the cluster-api book for this

@vincepri
Copy link
Member

/help

@k8s-ci-robot
Copy link
Contributor

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 11, 2020
@vincepri
Copy link
Member

Given that the work here has been done, the proposal has been merged, I'll go ahead and close this issue for now. Let's follow-up for provider implementer docs separately.

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Given that the work here has been done, the proposal has been merged, I'll go ahead and close this issue for now. Let's follow-up for provider implementer docs separately.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

8 participants