-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(eks): support INF2 instance types #27373
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.
A comment requesting an exemption should contain the text Exemption Request
. Additionally, if clarification is needed add Clarification Request
to a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the sagemaker changes and the eks changes independent of each other? if so, I'd prefer them being included in two separate PRs. Otherwise, this looks largely ok. The integ test will need to be run, however.
* ml.inf2.48xlarge | ||
*/ | ||
public static readonly INF2_48XLARGE = InstanceType.of('ml.inf2.48xlarge'); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a place to unit test at least one of these in sagemaker, just for sanity?
cluster.addAutoScalingGroupCapacity('InferenceInstances', { | ||
instanceType: new ec2.InstanceType('inf2.xlarge'), | ||
minCapacity: 1, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will have to run the integ test to update the snapshots. do you have capacity to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made further changes: I duplicated the integ test: 1 with asg inf1, the other inf2.
It is in failed state currently, not sure if I need to actually do something manually somewhere:
aws-cdk-eks-cluster-inf1-test: destroy failed Error: The stack named aws-cdk-eks-cluster-inf1-test is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [ClusterNodegroupDefaultCapacityNodeGroupRole55953B04, ClusterInf1InstancesInstanceRole67C931E4]. )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it say why it failed? the integ test should be able to be successfully deployed and deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually i can try to run this for you. our eks integ tests take forever and are wonky :(
✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
fix: added INF2 support to 1/ isGpuInstanceType to correctly select AMI, 2/ neuron-device-plugin-daemonset
INF2 is currently (wrongly) not included in the list of instance types mapping to GPU AMIs.
The change adds it to the list
inf2 not present in neuron-device-plugin-daemonset, added
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license