Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sagemaker: domains fail to deploy in either govcloud region. #31055

Closed
RossMeyers opened this issue Aug 7, 2024 · 9 comments
Closed

sagemaker: domains fail to deploy in either govcloud region. #31055

RossMeyers opened this issue Aug 7, 2024 · 9 comments
Assignees
Labels
@aws-cdk/aws-sagemaker Related to AWS SageMaker bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@RossMeyers
Copy link

Describe the bug

When trying to deploy a new sagemaker domain into aws-gov-west-1 or aws-gov-east-1, I get an Internal Error message. I confirmed that deploying the domain as is in us-east-1 works without issue.

Expected Behavior

sagemaker domain is created successfully from cdk code.

Current Behavior

domain fails to create with Internal Failure error.

Reproduction Steps

  1. Define domain resource
      const sagemakerDomain = new sagemaker.CfnDomain(this, 'SandboxDomain1', {
        authMode: 'IAM',
        domainName: 'sandbox-domain',
        defaultUserSettings: {
            executionRole: domainRole.roleArn,
        }, 
        subnetIds: [privateSubnetId0, privateSubnetId1],
        vpcId: vpcId
        });
    }
  1. cdk synth
  2. cdk deploy

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.151.0

Framework Version

No response

Node.js Version

v22.5.

OS

macos

Language

TypeScript

Language Version

No response

Other information

No response

@RossMeyers RossMeyers added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 7, 2024
@github-actions github-actions bot added the @aws-cdk/aws-sagemaker Related to AWS SageMaker label Aug 7, 2024
@ashishdhingra
Copy link
Contributor

ashishdhingra commented Aug 7, 2024

@RossMeyers Good afternoon. Thanks for opening the issue. Could you please share the following:

  • Does cdk synth command generates CF template successfully?
  • Does error occurs after running cdk deploy? Is the error reported by CloudFormation?
  • Was the deployment working fine earlier and started occurring after upgrading to a particular CDK version? If yes, could you please share the CDK version in which it was working fine?
  • Is it possible to share the error log? Please try to add -v flag while running cdk deploy to emit verbose error message.

Thanks,
Ashish

@ashishdhingra ashishdhingra added p1 needs-reproduction This issue needs reproduction. labels Aug 7, 2024
@ashishdhingra ashishdhingra self-assigned this Aug 7, 2024
@ashishdhingra ashishdhingra removed the needs-triage This issue or PR still needs to be triaged. label Aug 7, 2024
@RossMeyers
Copy link
Author

@ashishdhingra -

  1. cdk synth does successfully generate CF template.
  2. The error is reported by CloudFormation.
  3. No, the deployment was never working in either Govcloud Region, only to us-east-1.
  4. I don't see anything new when asking -v:
SandboxSagemakerStack | 1/3 | 4:38:16 PM | CREATE_IN_PROGRESS   | AWS::SageMaker::Domain | SandboxDomain1 
SandboxSagemakerStack | 1/3 | 4:38:17 PM | CREATE_FAILED        | AWS::SageMaker::Domain | SandboxDomain1 Internal Failure
SandboxSagemakerStack | 1/3 | 4:38:17 PM | ROLLBACK_IN_PROGRESS | AWS::CloudFormation::Stack | SandboxSagemakerStack The following resource(s) failed to create: [SandboxDomain1]. Rollback requested by user.
[16:38:26] Stack SandboxSagemakerStack has an ongoing operation in progress and is not stable (ROLLBACK_IN_PROGRESS)
[16:38:31] Stack SandboxSagemakerStack has an ongoing operation in progress and is not stable (ROLLBACK_IN_PROGRESS)
[16:38:37] Stack SandboxSagemakerStack has an ongoing operation in progress and is not stable (ROLLBACK_IN_PROGRESS)

@ashishdhingra
Copy link
Contributor

ashishdhingra commented Aug 7, 2024

@RossMeyers Thanks for your response. The same CDK stack works fine in us-east-1 region but fails to deploy in GovCloud region. Since cdk synth runs fine and Internal Failure is returned by CloudFormation, so most likely the error is not on the CDK side. To troubleshoot further:

Thanks,
Ashish

@ashishdhingra ashishdhingra added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 7, 2024
Copy link

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Aug 10, 2024
@RossMeyers
Copy link
Author

RossMeyers commented Aug 10, 2024

@ashishdhingra

  1. I am not seeing anything in CloudTrail that points to more detail on the error.
  2. I have attached the CFN output for the supporting network stack and sagemaker stack where I am trying to create the domain.

SandboxNetworkStack.assets.json
SandboxNetworkStack.template.json
SandboxSagemakerStack.assets.json
SandboxSagemakerStack.template.json

below is the raw cdk code:

network.ts

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';

export class SandboxNetworkStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    //Network Configuration
    const vpc = new ec2.Vpc(this, 'SandboxVpc1',{
      ipAddresses: ec2.IpAddresses.cidr('10.0.0.0/16'),
      subnetConfiguration: [
        {
          cidrMask: 24,
          name: 'public',
          subnetType: ec2.SubnetType.PUBLIC,
        },
        {
          cidrMask: 24,
          name: 'private',
          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
        },
      ],
      restrictDefaultSecurityGroup: false
    })

    // Security Groups
    const sagemakerSecurityGroup =  new ec2.SecurityGroup(this, 'SagemakerSecurityGroup', {
      securityGroupName: "SagemakerSecurityGroup",
      vpc: vpc,
      allowAllOutbound: false,
    });
    sagemakerSecurityGroup.addEgressRule(ec2.Peer.anyIpv4(), ec2.Port.tcpRange(1000,65535))
    new ec2.CfnSecurityGroupIngress(this, 'SagemakerSecurityGroupIngress', {
      groupId: sagemakerSecurityGroup.securityGroupId,
      sourceSecurityGroupId: sagemakerSecurityGroup.securityGroupId,
      ipProtocol: '-1'
    })

    // //// Exports ////

    // VPC 
    new cdk.CfnOutput(this, 'VpcId', {
      value: vpc.vpcId,
      exportName: `${this.stackName}-VpcId`,
    });

    // VPC Private Subnet IDs 
    const privateSubnets = vpc.privateSubnets;
    privateSubnets.forEach((subnet, index) => {
      new cdk.CfnOutput(this, `PrivateSubnetId${index}`, {
        value: subnet.subnetId,
        exportName: `${this.stackName}-PrivateSubnetId${index}`,
      });
    });

    // Sagemaker Security Group
    new cdk.CfnOutput(this, 'SagemakerSecurityGroupOutput', {
      value: sagemakerSecurityGroup.securityGroupId,
      exportName: `${this.stackName}-SagemakerSecurityGroup`,
    });
  }
}

sagemaker.ts

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as sagemaker from 'aws-cdk-lib/aws-sagemaker';
import * as iam from 'aws-cdk-lib/aws-iam';

export class SandboxSagemakerStack extends cdk.Stack {
    constructor(scope: Construct, id: string, props?: cdk.StackProps) {
      super(scope, id, props);

      const networkStackName = 'SandboxNetworkStack'

      const domainRole = new iam.Role(this, 'SandboxDomainRole1', {
        assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'),
        roleName: 'sandbox-domain-role',
        managedPolicies: [iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSageMakerFullAccess')]
      });

      const privateSubnetId0 = cdk.Fn.importValue(`${networkStackName}-PrivateSubnetId0`);
      const privateSubnetId1 = cdk.Fn.importValue(`${networkStackName}-PrivateSubnetId1`);
      const vpcId = cdk.Fn.importValue(`${networkStackName}-VpcId`);


      const sagemakerDomain = new sagemaker.CfnDomain(this, 'SandboxDomain1', {
        authMode: 'IAM',
        domainName: 'sandbox-domain',
        defaultUserSettings: {
            executionRole: domainRole.roleArn,
        }, 
        subnetIds: [privateSubnetId0, privateSubnetId1],
        vpcId: vpcId
        });
    }
}

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Aug 10, 2024
@ashishdhingra
Copy link
Contributor

Created internal ticket with CloudFormation team: P148328805

@ashishdhingra ashishdhingra added p2 effort/medium Medium work item – several days of effort and removed p1 needs-reproduction This issue needs reproduction. labels Aug 22, 2024
@ashishdhingra
Copy link
Contributor

Ticket to Sagemaker team: V1490394553

@ashishdhingra
Copy link
Contributor

Closing this issue since assistance is awaited from Sagemaker team.

@ashishdhingra ashishdhingra closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2024
Copy link

github-actions bot commented Oct 9, 2024

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
@aws-cdk/aws-sagemaker Related to AWS SageMaker bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

No branches or pull requests

2 participants