Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eks): support INF2 instance types #27373

Merged
merged 7 commits into from
Oct 4, 2023

Conversation

freschri
Copy link
Contributor

@freschri freschri commented Oct 2, 2023

fix: added INF2 support to 1/ isGpuInstanceType to correctly select AMI, 2/ neuron-device-plugin-daemonset

INF2 is currently (wrongly) not included in the list of instance types mapping to GPU AMIs.
The change adds it to the list

inf2 not present in neuron-device-plugin-daemonset, added


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added the p2 label Oct 2, 2023
@aws-cdk-automation aws-cdk-automation requested a review from a team October 2, 2023 12:32
@github-actions github-actions bot added the beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK label Oct 2, 2023
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@freschri freschri changed the title added INF2 support to isGpuInstanceType to correctly select AMI fix: added INF2 support to isGpuInstanceType to correctly select AMI Oct 2, 2023
@freschri freschri changed the title fix: added INF2 support to isGpuInstanceType to correctly select AMI fix: added INF2 support Oct 2, 2023
@scanlonp scanlonp added the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Oct 2, 2023
@freschri freschri changed the title fix: added INF2 support fix (aws-eks): INF2 support incomplete Oct 3, 2023
@freschri freschri changed the title fix (aws-eks): INF2 support incomplete fix: INF2 support incomplete Oct 3, 2023
@kaizencc kaizencc changed the title fix: INF2 support incomplete feat(sagemaker): support INF2 instance types Oct 3, 2023
@kaizencc kaizencc added the pr-linter/exempt-integ-test The PR linter will not require integ test changes label Oct 3, 2023
Copy link
Contributor

@kaizencc kaizencc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the sagemaker changes and the eks changes independent of each other? if so, I'd prefer them being included in two separate PRs. Otherwise, this looks largely ok. The integ test will need to be run, however.

* ml.inf2.48xlarge
*/
public static readonly INF2_48XLARGE = InstanceType.of('ml.inf2.48xlarge');

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a place to unit test at least one of these in sagemaker, just for sanity?

cluster.addAutoScalingGroupCapacity('InferenceInstances', {
instanceType: new ec2.InstanceType('inf2.xlarge'),
minCapacity: 1,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will have to run the integ test to update the snapshots. do you have capacity to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made further changes: I duplicated the integ test: 1 with asg inf1, the other inf2.
It is in failed state currently, not sure if I need to actually do something manually somewhere:
aws-cdk-eks-cluster-inf1-test: destroy failed Error: The stack named aws-cdk-eks-cluster-inf1-test is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [ClusterNodegroupDefaultCapacityNodeGroupRole55953B04, ClusterInf1InstancesInstanceRole67C931E4]. )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it say why it failed? the integ test should be able to be successfully deployed and deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i can try to run this for you. our eks integ tests take forever and are wonky :(

@kaizencc kaizencc added the pr-linter/exempt-readme The PR linter will not require README changes label Oct 3, 2023
@aws-cdk-automation aws-cdk-automation dismissed their stale review October 3, 2023 20:02

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@kaizencc kaizencc changed the title feat(sagemaker): support INF2 instance types feat(eks): support INF2 instance types Oct 3, 2023
@mergify mergify bot dismissed kaizencc’s stale review October 4, 2023 22:02

Pull request has been modified.

@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation aws-cdk-automation removed the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Oct 4, 2023
@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 3eb5e43
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify mergify bot merged commit bed9b8d into aws:main Oct 4, 2023
9 checks passed
@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK p2 pr-linter/exempt-integ-test The PR linter will not require integ test changes pr-linter/exempt-readme The PR linter will not require README changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants