Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SageMaker Studio long start-up time #352

Closed
atzev opened this issue Mar 6, 2023 · 4 comments
Closed

SageMaker Studio long start-up time #352

atzev opened this issue Mar 6, 2023 · 4 comments
Labels
priority: medium status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up type: bug Something isn't working type: enhancement Feature enhacement

Comments

@atzev
Copy link

atzev commented Mar 6, 2023

Describe the bug

As a user, when I open SageMaker Studio from the data.all portal I have to wait several minutes before I can start using Studio.

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*

New user:

  1. Log in to data.all with a user who has access to create SageMaker Studio profiles
  2. Go to ML Studio and create new ML Studio profile
  3. Click on launch JupyterLab
  4. Amazon SageMaker Studio creates the app
  5. Studio starts loading and displays "Loading... the loading screen is taking a long time. Would you like to clear the workspace or keep waiting?"
  6. The user has to click Keep Waiting or clear workspace for 2-3 times until Studio opens probably 10 or more minutes later

Returning user:

  1. Log in to data.all with a user who has access to SageMaker Studio profiles
  2. Go to ML Studio and click on launch JupyterLab
  3. Studio starts loading and displays "Loading... the loading screen is taking a long time. Would you like to clear the workspace or keep waiting?"
  4. The user has to click Keep Waiting or clear workspace for 2-3 times until Studio opens probably 10 or more minutes later

Expected behavior

For existing users, Studio load within seconds after the use clicks to launch Jupyter Lab icon.
For new users, Studio should launch as soon as the app is created.

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.8

AWS data.all version

1.4.1

Additional context

No response

@dlpzx dlpzx added type: enhancement Feature enhacement priority: medium status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up labels Mar 7, 2023
@dlpzx
Copy link
Contributor

dlpzx commented Mar 7, 2023

Hi @atzev, I can confirm this behavior. I will pass this information internally to see if we can pick up this task and improve the user experience when using ML Studio from data.all

@tasneemfazl
Copy link

Hello Team, do we have any fix or workaround for this issue?

@dlpzx dlpzx added the type: bug Something isn't working label Apr 18, 2023
@dlpzx
Copy link
Contributor

dlpzx commented Apr 18, 2023

A colleague has done some investigation and has a explanation for the issue. Basically in the CDK code, the SageMaker Domain is defined with “VPCOnly” together with the default VPC (which has only public subnets):
https://github.com/awslabs/aws-dataall/blob/main/backend/dataall/cdkproxy/stacks/environment.py#L581
SageMaker Domain doesn’t support this (VPCOnly + public subnet):
Doc link stating the requirements (no 1.) that when using VPCOnly mode the Subnets must be private
All the subnets available in the chosen VPC are used to create the Domain (might include a mix of public, private and isolated - probably we just want private and isolated and fail the deployment otherwise if the list is empty)
https://github.com/awslabs/aws-dataall/blob/main/backend/dataall/cdkproxy/stacks/environment.py#L555
In other words, it assumes that the environment account has a default VPC with Private Subnets. From my tests (I created 2 environment accounts from scratch in Isengard) the default VPC does not have private subnets by default, so you have to manually modify the VPC to create them.

A snippet of code that would work is to create the right VPC for SageMaker Studio with VPCOnly mode:

# LINE 553
            # try: # COMMENTING THIS PIECE WHERE WE DO NOT WANT TO USE A DEFAULT VPC
            #     default_vpc = ec2.Vpc.from_lookup(self, 'VPCStudio', is_default=True)
            #     vpc_id = default_vpc.vpc_id
            #     subnet_ids = [private_subnet.subnet_id for private_subnet in default_vpc.private_subnets]
            #     subnet_ids += [public_subnet.subnet_id for public_subnet in default_vpc.public_subnets]
            #     subnet_ids += [isolated_subnet.subnet_id for isolated_subnet in default_vpc.isolated_subnets]
            # except Exception as e:
            #     logger.error(f"Default VPC not found, Exception: {e}. If you don't own a default VPC, modify the networking configuration, or disable ML Studio upon environment creation.")

            # Create VPC with 3 Public Subnets and 3 Private subnets wit NAT Gateways
            log_group = logs.LogGroup(
                self,
                f'SageMakerStudio{self._environment.name}',
                log_group_name=f'/{self._environment.resourcePrefix}/{self._environment.name}/vpc/sagemakerstudio',
                retention=logs.RetentionDays.ONE_MONTH,
                removal_policy=RemovalPolicy.DESTROY,
            )
            vpc_flow_role = iam.Role(
                self, 'FlowLog',
                assumed_by=iam.ServicePrincipal('vpc-flow-logs.amazonaws.com')
            )
            vpc = ec2.Vpc(
                self,
                "VPC",
                max_azs=3,
                cidr="10.10.0.0/16",
                subnet_configuration=[
                    ec2.SubnetConfiguration(
                        subnet_type=ec2.SubnetType.PUBLIC,
                        name="Public",
                        cidr_mask=24
                    ),
                    ec2.SubnetConfiguration(
                        subnet_type=ec2.SubnetType.PRIVATE_WITH_NAT,
                        name="Private",
                        cidr_mask=24
                    ),
                ],
            )
            ec2.FlowLog(
                self, "StudioVPCFlowLog",
                resource_type=ec2.FlowLogResourceType.from_vpc(vpc),
                destination=ec2.FlowLogDestination.to_cloud_watch_logs(log_group, vpc_flow_role)
            )

            vpc_id = vpc.vpc_id
            subnet_ids = [private_subnet.subnet_id for private_subnet in vpc.private_subnets]
            sg = vpc.vpc_default_security_group
            sagemaker_domain = sagemaker.CfnDomain( # this is the same as before from here to the end

dlpzx added a commit that referenced this issue Apr 20, 2023
…for SageMaker domain (#420)

### Feature or Bugfix
- Feature
- Bugfix

### Detail
- Instead of creating the SageMaker Studio domain as a nested stack we
create it as part of the environment stack. To clearly show that the
resources created for SageMaker are part of the ML Studio functionality
they `check_existing_sagemaker_studio_domain` and
`create_sagemaker_domain_resources` are class methods of
`SageMakerDomain` placed in
`backend/dataall/cdkproxy/stacks/sagemakerstudio.py`.
- As reported in #352 data.all uses the default VPC of the account,
which does not fill the requirements for SM Studio. This results in long
start times. This PR also adds the creation of a dedicated VPC that
solves the issue of slow starts.
- It is not possible to modify the networking configuration of an
existing SageMaker Studio domain. In CloudFormation it deletes and
re-creates the domain (replacement= True), and if it has Studio users it
results in failure of the CloudFormation stack. For this reason I kept
the previous implementation using the default VPC. If a customer opts to
use a dedicated networking they need to delete the default VPC. This is
an interim solution and we will look for better ways to migrate to a
dedicated SM VPC once we get more info on how customers are using
data.all ML Studio

### Relates
- #409 
- #352 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
dlpzx added a commit that referenced this issue Apr 20, 2023
### Feature or Bugfix
- Enhancement

### Detail
- Added method to check if default VPC exists instead of relying in CFN
stack failing. Since it uses the cdk-look-up role we do not need to add
any ec2 permissions on the pivotRole

### Relates
- #352 #409 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@dlpzx
Copy link
Contributor

dlpzx commented Apr 25, 2023

Fixed in #409

@dlpzx dlpzx closed this as completed Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: medium status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up type: bug Something isn't working type: enhancement Feature enhacement
Projects
None yet
Development

No branches or pull requests

3 participants