-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aws] Spot Instances support #481
Conversation
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
@zuzzas Thank you for your contribution. |
cc @dank79430 |
Thanks a lot for the PR @zuzzas, looks good from the first cut review. |
pkg/driver/driver_aws.go
Outdated
return "Error", "Error", err | ||
runResult, err := svc.RunInstances(&inputConfig) | ||
if err != nil { | ||
metrics.APIFailedRequestCount.With(prometheus.Labels{"provider": "aws", "service": "ecs"}).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with this codebase so apologies if this is completely wrong, but is the service supposed to be "ec2" or is "ecs" correct here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are most certainly right, but this is out of the scope of this PR.
pkg/driver/driver_aws.go
Outdated
spotPrice = nil | ||
} | ||
|
||
spotInstanceRequestInput := &ec2.RequestSpotInstancesInput{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider using spot-fleet one-time request rather than spotRequest? I'm not sure if makes much of a difference with the current implementation since it only accepts 1 instance type anyways. But if multiple instance types could be passed, spot-fleet one-time request could be used with the capacity-optimized spot allocation strategy to select the best instance type to use. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-requests.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a machine-object which I believe maps to the SpotRequest of a single instance here. That could block the natural use of the SpotFleet. @zuzzas do you think there could be a way to get the SpotFleet in with the current machine-model?
ClusterAPI recently came up with the MachinePool abstraction: https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190919-machinepool-api.md . I am curious if we could also adopt something similar if that helps us bring the SpotFleet in.
- Another prominent benefit then would be for Azure, we can then directly consume VMSS. [@MSSedusch ] .
If we agree overall, we could open a design-doc to discuss MachinePool+SpotFleet for MCM. cc @rewiko @zuzzas @prashanth26 @bwagner5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking for the current machine-model. Spot Fleet has request types of maintain
or request
. The "maintain" type will maintain the desired capacity similar to how autoscaling groups work. The request
type will launch the instances at the desired capacity, but won't maintain the capacity. So a one-time spot fleet request
with a desired capacity of 1 would mirror the functionality that RequestSpotInstances provides but would also allow for the use of SpotAllocationStrategies. The capacity-optimized spot allocation strategy allows Spot Fleet to accept multiple instance-types and then it will choose one with a lot of capacity thereby reducing the chance of a spot interruption.
As far as machine-pool is concerned, ASG should take care of all of the use-cases even with spot, without the need to pull in spot fleet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a third option, we recently implemented spot instances within the openshift cluster-api provider for AWS by leveraging the RunInstancesInput
. I think we have the same effect as the current implementation but AWS handles a lot of it for us by setting up the spot instance request and checking it for us. Don't you know if you considered this at all?
You may be able to simplify this a lot if you followed the same route as us, create the InstanceMarketOptions
as below, add that to RunInstancesInput
, and then if the instance can't be fulfilled immediately, AWS returns a 400 so you can tell that the creation failed, otherwise it accepts it and you get a ressponse like a normal RunInstances
.
https://github.com/JoelSpeed/cluster-api-provider-aws/blob/e1b2632853623790c11d60f1bf2a34cc7d697eec/pkg/actuators/machine/instances.go#L379-L407
Behind the scenes it relies on an individual spot instance request as you have implemented here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoelSpeed, well, that's embarrassing. I've completely glossed over this attribute in RunInstancesInput
. I'll refactor my PR today, should shave off a lot of useless code.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, happy to help 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zuzzas fwiw you can get more context on our reasoning and choices (which seems pretty aligned with this PR) here https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/spot-instances.md and https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20200330-spot-instances.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to colleagues from Red Hat, I've significantly simplified this PR.
Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
@hardikdr |
@zuzzas Thanks, the PR is much simplified now. I believe, these changes enable us for the single spot-request. I am curious if there could be a feasible way to also include SpotFleets, and how could that fit well with the machine-model we have. Of course, SpotFleet could be out of the scope of this PR, if significant changes are expected. I also had couple of behavioral questions:
|
Absolutely no issues, in fact, big thanks again for enabling the Spot-support, really appreciate it. :) |
Hi @zuzzas , Thank you for the great PR. Apologies for the delay in testing this PR. I just tested it. Works pretty well :) However, it would be great if you provide some validation (could be as simple as even making sure that it is a string, as people might confused and enter a floating number here) for this field. The validation code can be found here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm with Prashanth's suggestion.
Other improvements I believe can be done in subsequent PRs.
@hardikdr I've yet to provide:
|
This is built into the AWS API, send an empty value
And I can confirm from my experimenting when implementing this feature within Openshift that it does indeed work and from the look of your implementation, this should work for you too |
Then there is one less thing to test and worry about. Thanks, @JoelSpeed! |
Nice, that's pretty cool, and absolutely no hurries. |
@bwagner5 I got a chance to read up a bit, and really excited to enable support for SpotFleet, specifically |
It'll output an error on JSON unmarshal. And if a user inputs a garbage string, an AWS API will return an error on instance creation. And we don't have to deal with floating numbers!
The problem is that not a single provider in the MCM implements a back-off procedure. I believe such a change should be placed into the MachineSet Controller, and it certainly is outside of the scope of this PR. |
Agreed, I plan to pick up the topic of back-off on failure in general for MCM soon. I am overall happy with the current state of the PR, would you want to take a final look @prashanth26 ? |
Okay sure. Let's drop it for now. |
What this PR does / why we need it:
This takes care of the AWS Spot Instances support part of the #27.
Which issue(s) this PR fixes:
AWS part of #27
Release note:
Short notes