[aws] Spot Instances support #481

zuzzas · 2020-06-29T11:04:22Z

What this PR does / why we need it:

This takes care of the AWS Spot Instances support part of the #27.

Which issue(s) this PR fixes:
AWS part of #27

Release note:

Support for Spot Instances is available in AWS driver. If the `spotPrice` is empty, price is automatically set to the on-demand price so that Spot instance can launch immediately.

Short notes

single spot instance support enabled for AWS driver
SpotFleet not supported as MachineAPIs potential of supporting it needs to be looked into

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

gardener-robot · 2020-06-29T11:04:26Z

@zuzzas Thank you for your contribution.

hardikdr · 2020-07-07T07:30:01Z

cc @dank79430

hardikdr · 2020-07-07T18:55:04Z

Thanks a lot for the PR @zuzzas, looks good from the first cut review.
I'll need a bit of the time to test this out.

pkg/driver/driver_aws.go

bwagner5 · 2020-07-07T19:14:13Z

pkg/driver/driver_aws.go

-		return "Error", "Error", err
+		runResult, err := svc.RunInstances(&inputConfig)
+		if err != nil {
+			metrics.APIFailedRequestCount.With(prometheus.Labels{"provider": "aws", "service": "ecs"}).Inc()


I'm not super familiar with this codebase so apologies if this is completely wrong, but is the service supposed to be "ec2" or is "ecs" correct here?

You are most certainly right, but this is out of the scope of this PR.

bwagner5 · 2020-07-07T19:21:47Z

pkg/driver/driver_aws.go

+			spotPrice = nil
+		}
+
+		spotInstanceRequestInput := &ec2.RequestSpotInstancesInput{


did you consider using spot-fleet one-time request rather than spotRequest? I'm not sure if makes much of a difference with the current implementation since it only accepts 1 instance type anyways. But if multiple instance types could be passed, spot-fleet one-time request could be used with the capacity-optimized spot allocation strategy to select the best instance type to use. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-requests.html

We have a machine-object which I believe maps to the SpotRequest of a single instance here. That could block the natural use of the SpotFleet. @zuzzas do you think there could be a way to get the SpotFleet in with the current machine-model?

ClusterAPI recently came up with the MachinePool abstraction: https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190919-machinepool-api.md . I am curious if we could also adopt something similar if that helps us bring the SpotFleet in.

Another prominent benefit then would be for Azure, we can then directly consume VMSS. [@MSSedusch ] .

If we agree overall, we could open a design-doc to discuss MachinePool+SpotFleet for MCM. cc @rewiko @zuzzas @prashanth26 @bwagner5

Yes, I was thinking for the current machine-model. Spot Fleet has request types of maintain or request. The "maintain" type will maintain the desired capacity similar to how autoscaling groups work. The request type will launch the instances at the desired capacity, but won't maintain the capacity. So a one-time spot fleet request with a desired capacity of 1 would mirror the functionality that RequestSpotInstances provides but would also allow for the use of SpotAllocationStrategies. The capacity-optimized spot allocation strategy allows Spot Fleet to accept multiple instance-types and then it will choose one with a lot of capacity thereby reducing the chance of a spot interruption.

As far as machine-pool is concerned, ASG should take care of all of the use-cases even with spot, without the need to pull in spot fleet.

As a third option, we recently implemented spot instances within the openshift cluster-api provider for AWS by leveraging the RunInstancesInput. I think we have the same effect as the current implementation but AWS handles a lot of it for us by setting up the spot instance request and checking it for us. Don't you know if you considered this at all?

You may be able to simplify this a lot if you followed the same route as us, create the InstanceMarketOptions as below, add that to RunInstancesInput, and then if the instance can't be fulfilled immediately, AWS returns a 400 so you can tell that the creation failed, otherwise it accepts it and you get a ressponse like a normal RunInstances.
https://github.com/JoelSpeed/cluster-api-provider-aws/blob/e1b2632853623790c11d60f1bf2a34cc7d697eec/pkg/actuators/machine/instances.go#L379-L407

Behind the scenes it relies on an individual spot instance request as you have implemented here.

@JoelSpeed, well, that's embarrassing. I've completely glossed over this attribute in RunInstancesInput. I'll refactor my PR today, should shave off a lot of useless code.

Thank you!

No worries, happy to help 😅

@zuzzas fwiw you can get more context on our reasoning and choices (which seems pretty aligned with this PR) here https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/spot-instances.md and https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20200330-spot-instances.md

Thanks to colleagues from Red Hat, I've significantly simplified this PR.

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

zuzzas · 2020-07-10T11:49:58Z

@hardikdr
I've simplified this PR. I apologize for the initial over-engineering. I promise to conduct proper research next time. :(

hardikdr · 2020-07-16T05:49:06Z

@zuzzas Thanks, the PR is much simplified now. I believe, these changes enable us for the single spot-request. I am curious if there could be a feasible way to also include SpotFleets, and how could that fit well with the machine-model we have. Of course, SpotFleet could be out of the scope of this PR, if significant changes are expected.

I also had couple of behavioral questions:

Should we have mechanism to fallback on the on-demand instance if spot price can not be met? Or is it already the case?
What happens to the machine object if creation, it keeps retrying at the moment? We should then have good backoff on failures.

hardikdr · 2020-07-16T05:51:00Z

I've simplified this PR. I apologize for the initial over-engineering. I promise to conduct proper research next time. :(

Absolutely no issues, in fact, big thanks again for enabling the Spot-support, really appreciate it. :)

prashanth26 · 2020-07-18T11:28:29Z

Hi @zuzzas ,

Thank you for the great PR. Apologies for the delay in testing this PR. I just tested it. Works pretty well :) However, it would be great if you provide some validation (could be as simple as even making sure that it is a string, as people might confused and enter a floating number here) for this field. The validation code can be found here

hardikdr

Lgtm with Prashanth's suggestion.

Other improvements I believe can be done in subsequent PRs.

zuzzas · 2020-07-22T10:30:50Z

@hardikdr
I'd like to request a timeout until this weekend. I am preparing a video for KubeCon Europe.

I've yet to provide:

Validation
Spot Instance Creation back-off if they cannot be created at the time (due to price or datacenter capacity)
~~Check the pricing on-demand fallback~~

JoelSpeed · 2020-07-22T10:44:28Z

Should we have mechanism to fallback on the on-demand instance if spot price can not be met? Or is it already the case?

This is built into the AWS API, send an empty value inputConfig.InstanceMarketOptions.SpotOptions.MaxPrice and the spot request that gets created is defaulted to the on-demand price for that instance See docs for the field.

MaxPrice
The maximum hourly price you're willing to pay for the Spot Instances. The default is the On-Demand price.

Type: String

Required: No

And I can confirm from my experimenting when implementing this feature within Openshift that it does indeed work and from the look of your implementation, this should work for you too

zuzzas · 2020-07-22T10:46:37Z

Then there is one less thing to test and worry about. Thanks, @JoelSpeed!

hardikdr · 2020-07-23T03:48:03Z

@hardikdr
I'd like to request a timeout until this weekend. I am preparing a video for KubeCon Europe.

Nice, that's pretty cool, and absolutely no hurries.
And very best of luck for the talk, topic sounds really interesting - eagerly waiting for it :)

hardikdr · 2020-07-23T03:54:37Z

Yes, I was thinking for the current machine-model. Spot Fleet has request types of maintain or request. The "maintain" type will maintain the desired capacity similar to how autoscaling groups work.

@bwagner5 I got a chance to read up a bit, and really excited to enable support for SpotFleet, specifically maintain part. I'm though not very clear around how it could be mapped efficiently the MachineAPI model. One of the prominent ways, of course, is to support MahchinePool CRDs and support ASG along with single machine-requests. I am though, interested in investigating if there could be any reliable way to consume existing MachineAPI-model somehow[of course not hacky]. I'll thread posted on the updates. Thanks for the suggestion above.

zuzzas · 2020-08-01T07:57:35Z

@prashanth26

However, it would be great if you provide some validation (could be as simple as even making sure that it is a string, as people might confused and enter a floating number here) for this field.

It'll output an error on JSON unmarshal. And if a user inputs a garbage string, an AWS API will return an error on instance creation. And we don't have to deal with floating numbers!

@hardikdr

What happens to the machine object if creation, it keeps retrying at the moment? We should then have good backoff on failures.

The problem is that not a single provider in the MCM implements a back-off procedure. I believe such a change should be placed into the MachineSet Controller, and it certainly is outside of the scope of this PR.

hardikdr · 2020-08-03T13:23:11Z

The problem is that not a single provider in the MCM implements a back-off procedure. I believe such a change should be placed into the MachineSet Controller, and it certainly is outside of the scope of this PR./

Agreed, I plan to pick up the topic of back-off on failure in general for MCM soon.

I am overall happy with the current state of the PR, would you want to take a final look @prashanth26 ?

prashanth26 · 2020-08-04T04:24:24Z

It'll output an error on JSON unmarshal. And if a user inputs a garbage string, an AWS API will return an error on instance creation. And we don't have to deal with floating numbers!

Okay sure. Let's drop it for now.

[aws] Spot Instances support

b76e84f

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

zuzzas requested review from ggaurav10 and a team as code owners June 29, 2020 11:04

hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 7, 2020

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 7, 2020

bwagner5 reviewed Jul 7, 2020

View reviewed changes

Simplification

48bb661

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2020

gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2020

hardikdr approved these changes Jul 22, 2020

View reviewed changes

prashanth26 approved these changes Aug 4, 2020

View reviewed changes

hardikdr merged commit 9b19688 into gardener:master Aug 4, 2020

zuzzas deleted the upstreaming-spot-instances branch August 5, 2020 07:14

prashanth26 mentioned this pull request Aug 13, 2020

Spot VM Support #27

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aws] Spot Instances support #481

[aws] Spot Instances support #481

zuzzas commented Jun 29, 2020 •

edited by himanshu-kun

Loading

gardener-robot commented Jun 29, 2020

hardikdr commented Jul 7, 2020

hardikdr commented Jul 7, 2020

bwagner5 Jul 7, 2020

zuzzas Jul 10, 2020

bwagner5 Jul 7, 2020

hardikdr Jul 8, 2020

bwagner5 Jul 8, 2020 •

edited

Loading

JoelSpeed Jul 9, 2020

zuzzas Jul 9, 2020 •

edited

Loading

JoelSpeed Jul 9, 2020

enxebre Jul 9, 2020 •

edited

Loading

zuzzas Jul 10, 2020

zuzzas commented Jul 10, 2020

hardikdr commented Jul 16, 2020

hardikdr commented Jul 16, 2020

prashanth26 commented Jul 18, 2020 •

edited

Loading

hardikdr left a comment

zuzzas commented Jul 22, 2020 •

edited

Loading

JoelSpeed commented Jul 22, 2020

zuzzas commented Jul 22, 2020 •

edited

Loading

hardikdr commented Jul 23, 2020

hardikdr commented Jul 23, 2020

zuzzas commented Aug 1, 2020

hardikdr commented Aug 3, 2020

prashanth26 commented Aug 4, 2020

[aws] Spot Instances support #481

[aws] Spot Instances support #481

Conversation

zuzzas commented Jun 29, 2020 • edited by himanshu-kun Loading

Short notes

gardener-robot commented Jun 29, 2020

hardikdr commented Jul 7, 2020

hardikdr commented Jul 7, 2020

bwagner5 Jul 7, 2020

Choose a reason for hiding this comment

zuzzas Jul 10, 2020

Choose a reason for hiding this comment

bwagner5 Jul 7, 2020

Choose a reason for hiding this comment

hardikdr Jul 8, 2020

Choose a reason for hiding this comment

bwagner5 Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

JoelSpeed Jul 9, 2020

Choose a reason for hiding this comment

zuzzas Jul 9, 2020 • edited Loading

Choose a reason for hiding this comment

JoelSpeed Jul 9, 2020

Choose a reason for hiding this comment

enxebre Jul 9, 2020 • edited Loading

Choose a reason for hiding this comment

zuzzas Jul 10, 2020

Choose a reason for hiding this comment

zuzzas commented Jul 10, 2020

hardikdr commented Jul 16, 2020

hardikdr commented Jul 16, 2020

prashanth26 commented Jul 18, 2020 • edited Loading

hardikdr left a comment

Choose a reason for hiding this comment

zuzzas commented Jul 22, 2020 • edited Loading

JoelSpeed commented Jul 22, 2020

zuzzas commented Jul 22, 2020 • edited Loading

hardikdr commented Jul 23, 2020

hardikdr commented Jul 23, 2020

zuzzas commented Aug 1, 2020

hardikdr commented Aug 3, 2020

prashanth26 commented Aug 4, 2020

zuzzas commented Jun 29, 2020 •

edited by himanshu-kun

Loading

bwagner5 Jul 8, 2020 •

edited

Loading

zuzzas Jul 9, 2020 •

edited

Loading

enxebre Jul 9, 2020 •

edited

Loading

prashanth26 commented Jul 18, 2020 •

edited

Loading

zuzzas commented Jul 22, 2020 •

edited

Loading

zuzzas commented Jul 22, 2020 •

edited

Loading