Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot VM Support #27

Closed
5 tasks
vlerenc opened this issue Feb 10, 2018 · 16 comments
Closed
5 tasks

Spot VM Support #27

vlerenc opened this issue Feb 10, 2018 · 16 comments
Assignees
Labels
area/cost Cost related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/1m Effort for issue is around 1 month kind/enhancement Enhancement, improvement, extension kind/roadmap Roadmap BLI lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/accepted Issue was accepted as something we need to work on

Comments

@vlerenc
Copy link
Member

vlerenc commented Feb 10, 2018

Stories

  • As user/operator I want to use AWS spot, Azure Low-Priotity and/or GCP Preemptible VM instances, so that my landscape runs at lower costs.

Motivation

Money, sure, but also some form of chaos monkey that should help train the application developers that all resources will eventually fail.

Acceptance Criteria

Remarks

Looks like Bosh had the same idea (well, everybody can if they have cattle VMs).

Enhancement/Implementation Proposal (optional)

Ideally, link to EP, e.g. a GEP in Gardener (https://github.com/gardener/gardener/tree/master/docs/proposals), alternatively prose here.

Challenges

@prashanth26 prashanth26 added the kind/enhancement Enhancement, improvement, extension label Apr 26, 2018
@hardikdr
Copy link
Member

GKE enabled support for Pre-emptible VMs: https://cloud.google.com/kubernetes-engine/docs/concepts/preemptible-vm

@vlerenc
Copy link
Member Author

vlerenc commented Jun 19, 2018

Yes, I saw that quite some time ago. That's why I said in one of our syncs, we won't be the first anymore. It really does make a lot of sense, too. On the other hand, our priorities are right. We know we like to have it eventually, but we can't do everything at the same time.

@vlerenc
Copy link
Member Author

vlerenc commented Jun 20, 2018

Funny, today I even saw this (thanks @afritzler): https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-global-availability.html

Maybe we can beat GKE with preemptible TPUs in Kubernetes clusters then? ;-) Just kidding, but TPU support is definitely also interesting and somehow different from how AWS handles GPU support (that already works, because MCM doesn't care, but TPUs must be assigned, @afritzler and @rfranzke told me a couple of days ago).

@vlerenc vlerenc added the area/cost Cost related label Jul 10, 2018
@vlerenc vlerenc added the status/accepted Issue was accepted as something we need to work on label Aug 5, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Oct 5, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 5, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 4, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Apr 6, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jun 6, 2019
@vlerenc
Copy link
Member Author

vlerenc commented Jul 2, 2019

Once we have the time to work on this one (GKE and others support that, too - just saw it with Banzai as well), we might leverage this here: org:banzaicloud repo:spot-price-exporter.

@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 1, 2019
@ghost ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Nov 1, 2019
@ghost ghost added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 1, 2020
@samedguener
Copy link

@vlerenc Are there any updates regarding node pools with hotspot instances? We are looking forward for this!

Best,
Samed

@vlerenc
Copy link
Member Author

vlerenc commented Jan 27, 2020

No, no update. So far, nobody even contacted us with the concrete need. You are the first. Most workload can't cope with that kind of infrastructure. Can you elaborate about your use case a bit?

cc @hardikdr @prashanth26 @amshuman-kr @juergenschneider

@samedguener
Copy link

We are planning to have node pools with hotspot instances

  • to run in future batch jobs such as the training of machine learning model (long running batch jobs) on hotspot instances and reschedule them during unavailability. This will allow us to reduce our TCOs on the long-term.
  • reduce cost in our development costs by allowing to scheduling pods during PRs tests to such nodepools.

Best,
Samed

@ghost ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jan 18, 2022
@vlerenc vlerenc removed this from the 2022-Q1 milestone May 25, 2022
@vlerenc vlerenc added this to the 2022-Q3 milestone Jul 5, 2022
@vlerenc vlerenc modified the milestones: 2022-Q3, 2022-Q4, 2023-Q1 Oct 18, 2022
@jscarney
Copy link

Update on ability to support 'spot' instances across Azure and GCP? Will be useful towards cost savings.

@himanshu-kun himanshu-kun added priority/1 Priority (lower number equals higher priority) and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) priority/important-longterm labels Feb 20, 2023
@gardener-robot gardener-robot removed the priority/4 Priority (lower number equals higher priority) label Feb 20, 2023
@himanshu-kun himanshu-kun added size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/planning Needs (more) planning with other MCM maintainers labels Feb 20, 2023
@himanshu-kun himanshu-kun removed their assignment Feb 20, 2023
@himanshu-kun himanshu-kun removed this from the 2023-Q1 milestone Feb 22, 2023
@gardener-robot gardener-robot added kind/roadmap Roadmap BLI and removed roadmap/cloud labels Mar 23, 2023
@vlerenc
Copy link
Member Author

vlerenc commented Oct 10, 2023

There were quite some updates: E.g. AWS, Azure, and GCP now all support spot instances with dynamic prices (Azure and GCP deprecated their old models in favour of the new ones that are all called spot VMs). GCP doesn't support a threshold though, which is less than optimal (you can always look up the price though and act accordingly). Grace periods vary (AWS 120s, Azure and GCP 30s), but all notify and we could use that for immediate drain.

I also looked up auto-scaling groups: Now they all support multiple zones, but only AWS and Azure support mixing on-demand and spot instances. AWS' feature seems strange though, because different than Azure and GCP, the spot price may go beyond even the on-demand price. When the user sets a limit, e.g. at the regular on-demand price, AWS won't add capacity and you are left with the on-demand baseline, but Azure fulfils the request, capped at the on-demand price, so you get your machines still. That's at least how I understood the docs.

Rebalancing is another open point, e.g. never, always, grace_period and/or cost_gap maybe?

@vlerenc vlerenc changed the title Spot / Low-Priority / Preemptible VM Support Spot VM Support Oct 10, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Aug 6, 2024
@vlerenc
Copy link
Member Author

vlerenc commented Nov 12, 2024

There are some stakeholders that could still benefit from spot instances as they have enormous fluctuations during the day/week:

Image

@elankath
Copy link
Contributor

elankath commented Nov 12, 2024

Will need to investigate stakeholder clusters to get an idea of their workload behaviour.

@aaronfern
Copy link
Contributor

Closing this issue since it's very old, and there seems to be no traction.

Proper support can only be added only when the MachineDeployment is overhauled to support usage of spot and regular instances. A new issue can be created post that for spot VM support

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cost Cost related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/1m Effort for issue is around 1 month kind/enhancement Enhancement, improvement, extension kind/roadmap Roadmap BLI lifecycle/stale Nobody worked on this for 6 months (will further age) needs/planning Needs (more) planning with other MCM maintainers priority/1 Priority (lower number equals higher priority) size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/accepted Issue was accepted as something we need to work on
Projects
None yet
Development

No branches or pull requests