Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws] Spot Instances support #481

Merged
merged 2 commits into from
Aug 4, 2020

Conversation

zuzzas
Copy link
Contributor

@zuzzas zuzzas commented Jun 29, 2020

What this PR does / why we need it:

This takes care of the AWS Spot Instances support part of the #27.

Which issue(s) this PR fixes:
AWS part of #27

Release note:

Support for Spot Instances is available in AWS driver. If the `spotPrice` is empty, price is automatically set to the on-demand price so that Spot instance can launch immediately.

Short notes

  • single spot instance support enabled for AWS driver
  • SpotFleet not supported as MachineAPIs potential of supporting it needs to be looked into

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
@zuzzas zuzzas requested review from ggaurav10 and a team as code owners June 29, 2020 11:04
@gardener-robot
Copy link

@zuzzas Thank you for your contribution.

@hardikdr
Copy link
Member

hardikdr commented Jul 7, 2020

cc @dank79430

@hardikdr hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 7, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 7, 2020
@hardikdr
Copy link
Member

hardikdr commented Jul 7, 2020

Thanks a lot for the PR @zuzzas, looks good from the first cut review.
I'll need a bit of the time to test this out.

pkg/driver/driver_aws.go Outdated Show resolved Hide resolved
return "Error", "Error", err
runResult, err := svc.RunInstances(&inputConfig)
if err != nil {
metrics.APIFailedRequestCount.With(prometheus.Labels{"provider": "aws", "service": "ecs"}).Inc()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with this codebase so apologies if this is completely wrong, but is the service supposed to be "ec2" or is "ecs" correct here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are most certainly right, but this is out of the scope of this PR.

spotPrice = nil
}

spotInstanceRequestInput := &ec2.RequestSpotInstancesInput{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you consider using spot-fleet one-time request rather than spotRequest? I'm not sure if makes much of a difference with the current implementation since it only accepts 1 instance type anyways. But if multiple instance types could be passed, spot-fleet one-time request could be used with the capacity-optimized spot allocation strategy to select the best instance type to use. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-requests.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a machine-object which I believe maps to the SpotRequest of a single instance here. That could block the natural use of the SpotFleet. @zuzzas do you think there could be a way to get the SpotFleet in with the current machine-model?

ClusterAPI recently came up with the MachinePool abstraction: https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190919-machinepool-api.md . I am curious if we could also adopt something similar if that helps us bring the SpotFleet in.

  • Another prominent benefit then would be for Azure, we can then directly consume VMSS. [@MSSedusch ] .

If we agree overall, we could open a design-doc to discuss MachinePool+SpotFleet for MCM. cc @rewiko @zuzzas @prashanth26 @bwagner5

Copy link

@bwagner5 bwagner5 Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking for the current machine-model. Spot Fleet has request types of maintain or request. The "maintain" type will maintain the desired capacity similar to how autoscaling groups work. The request type will launch the instances at the desired capacity, but won't maintain the capacity. So a one-time spot fleet request with a desired capacity of 1 would mirror the functionality that RequestSpotInstances provides but would also allow for the use of SpotAllocationStrategies. The capacity-optimized spot allocation strategy allows Spot Fleet to accept multiple instance-types and then it will choose one with a lot of capacity thereby reducing the chance of a spot interruption.

As far as machine-pool is concerned, ASG should take care of all of the use-cases even with spot, without the need to pull in spot fleet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a third option, we recently implemented spot instances within the openshift cluster-api provider for AWS by leveraging the RunInstancesInput. I think we have the same effect as the current implementation but AWS handles a lot of it for us by setting up the spot instance request and checking it for us. Don't you know if you considered this at all?

You may be able to simplify this a lot if you followed the same route as us, create the InstanceMarketOptions as below, add that to RunInstancesInput, and then if the instance can't be fulfilled immediately, AWS returns a 400 so you can tell that the creation failed, otherwise it accepts it and you get a ressponse like a normal RunInstances.
https://github.com/JoelSpeed/cluster-api-provider-aws/blob/e1b2632853623790c11d60f1bf2a34cc7d697eec/pkg/actuators/machine/instances.go#L379-L407

Behind the scenes it relies on an individual spot instance request as you have implemented here.

Copy link
Contributor Author

@zuzzas zuzzas Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed, well, that's embarrassing. I've completely glossed over this attribute in RunInstancesInput. I'll refactor my PR today, should shave off a lot of useless code.

Thank you!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, happy to help 😅

Copy link

@enxebre enxebre Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to colleagues from Red Hat, I've significantly simplified this PR.

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
@zuzzas
Copy link
Contributor Author

zuzzas commented Jul 10, 2020

@hardikdr
I've simplified this PR. I apologize for the initial over-engineering. I promise to conduct proper research next time. :(

@hardikdr
Copy link
Member

@zuzzas Thanks, the PR is much simplified now. I believe, these changes enable us for the single spot-request. I am curious if there could be a feasible way to also include SpotFleets, and how could that fit well with the machine-model we have. Of course, SpotFleet could be out of the scope of this PR, if significant changes are expected.

I also had couple of behavioral questions:

  • Should we have mechanism to fallback on the on-demand instance if spot price can not be met? Or is it already the case?
  • What happens to the machine object if creation, it keeps retrying at the moment? We should then have good backoff on failures.

@hardikdr
Copy link
Member

I've simplified this PR. I apologize for the initial over-engineering. I promise to conduct proper research next time. :(

Absolutely no issues, in fact, big thanks again for enabling the Spot-support, really appreciate it. :)

@hardikdr hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2020
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jul 16, 2020
@prashanth26
Copy link
Contributor

prashanth26 commented Jul 18, 2020

Hi @zuzzas ,

Thank you for the great PR. Apologies for the delay in testing this PR. I just tested it. Works pretty well :) However, it would be great if you provide some validation (could be as simple as even making sure that it is a string, as people might confused and enter a floating number here) for this field. The validation code can be found here

Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm with Prashanth's suggestion.

Other improvements I believe can be done in subsequent PRs.

@zuzzas
Copy link
Contributor Author

zuzzas commented Jul 22, 2020

@hardikdr
I'd like to request a timeout until this weekend. I am preparing a video for KubeCon Europe.

I've yet to provide:

  1. Validation
  2. Spot Instance Creation back-off if they cannot be created at the time (due to price or datacenter capacity)
  3. Check the pricing on-demand fallback

@JoelSpeed
Copy link

Should we have mechanism to fallback on the on-demand instance if spot price can not be met? Or is it already the case?

This is built into the AWS API, send an empty value inputConfig.InstanceMarketOptions.SpotOptions.MaxPrice and the spot request that gets created is defaulted to the on-demand price for that instance See docs for the field.

MaxPrice
The maximum hourly price you're willing to pay for the Spot Instances. The default is the On-Demand price.

Type: String

Required: No

And I can confirm from my experimenting when implementing this feature within Openshift that it does indeed work and from the look of your implementation, this should work for you too

@zuzzas
Copy link
Contributor Author

zuzzas commented Jul 22, 2020

Then there is one less thing to test and worry about. Thanks, @JoelSpeed!

@hardikdr
Copy link
Member

@hardikdr
I'd like to request a timeout until this weekend. I am preparing a video for KubeCon Europe.

Nice, that's pretty cool, and absolutely no hurries.
And very best of luck for the talk, topic sounds really interesting - eagerly waiting for it :)

@hardikdr
Copy link
Member

Yes, I was thinking for the current machine-model. Spot Fleet has request types of maintain or request. The "maintain" type will maintain the desired capacity similar to how autoscaling groups work.

@bwagner5 I got a chance to read up a bit, and really excited to enable support for SpotFleet, specifically maintain part. I'm though not very clear around how it could be mapped efficiently the MachineAPI model. One of the prominent ways, of course, is to support MahchinePool CRDs and support ASG along with single machine-requests. I am though, interested in investigating if there could be any reliable way to consume existing MachineAPI-model somehow[of course not hacky]. I'll thread posted on the updates. Thanks for the suggestion above.

@zuzzas
Copy link
Contributor Author

zuzzas commented Aug 1, 2020

@prashanth26

However, it would be great if you provide some validation (could be as simple as even making sure that it is a string, as people might confused and enter a floating number here) for this field.

It'll output an error on JSON unmarshal. And if a user inputs a garbage string, an AWS API will return an error on instance creation. And we don't have to deal with floating numbers!

@hardikdr

What happens to the machine object if creation, it keeps retrying at the moment? We should then have good backoff on failures.

The problem is that not a single provider in the MCM implements a back-off procedure. I believe such a change should be placed into the MachineSet Controller, and it certainly is outside of the scope of this PR.

@hardikdr
Copy link
Member

hardikdr commented Aug 3, 2020

The problem is that not a single provider in the MCM implements a back-off procedure. I believe such a change should be placed into the MachineSet Controller, and it certainly is outside of the scope of this PR./

Agreed, I plan to pick up the topic of back-off on failure in general for MCM soon.

I am overall happy with the current state of the PR, would you want to take a final look @prashanth26 ?

@prashanth26
Copy link
Contributor

It'll output an error on JSON unmarshal. And if a user inputs a garbage string, an AWS API will return an error on instance creation. And we don't have to deal with floating numbers!

Okay sure. Let's drop it for now.

@hardikdr hardikdr merged commit 9b19688 into gardener:master Aug 4, 2020
@zuzzas zuzzas deleted the upstreaming-spot-instances branch August 5, 2020 07:14
@prashanth26 prashanth26 mentioned this pull request Aug 13, 2020
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants