Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix redeploying SageMaker Model Monitoring module. #168

Merged
merged 5 commits into from
Jul 5, 2024

Conversation

clokep
Copy link
Collaborator

@clokep clokep commented Jul 3, 2024

Describe your changes

Fixes redeploying SageMaker Model Monitoring. Previously if you modified a parameter then CloudFormation would error due to attempting to replace resources (via a create, then delete) using the same name. This adds a timestamp to names so that they'll be unique. Ideally we would allow CloudFormation to generate the names, but this does not seem to work -- CloudFormation is not properly waiting for the job definition to be created before attempting to access the name / propagating it to the monitoring schedule.

The downside of this approach is the monitoring stack will essentially be rebuilt for every update, even if no changes are made. This is double bad since updates will fail if any of the monitoring jobs are running. A potential solution to this could be to use a hash of the input parameters instead of a timestamp.

Issue ticket number and link

Checklist before requesting a review

  • I updated CHANGELOG.MD with a description of my changes - not updated since the SageMaker Monitoring module hasn't yet been in a release
  • If the change was to a module, I ran the code validation script (scripts/validate.sh)
  • If the change was to a module, I have added thorough tests
  • If the change was to a module, I have added/updated the module's README.md
  • If a module was added, I added a reference to the module to the repository's README.md
  • I verified that my code deploys successfully using seedfarmer apply

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: AIOpsModuleTestsInfrastruct-doMLXEYsnmxr
  • Commit ID: 22fdfdb
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: AIOpsModuleTestsInfrastruct-doMLXEYsnmxr
  • Commit ID: 5d792a2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@clokep
Copy link
Collaborator Author

clokep commented Jul 3, 2024

The downside of this approach is the monitoring stack will essentially be rebuilt for every update, even if no changes are made. This is double bad since updates will fail if any of the monitoring jobs are running. A potential solution to this could be to use a hash of the input parameters instead of a timestamp.

This was easy enough to fix by using a hash of the input parameters. (Note that the hash doesn't need to be secure due to not being used in a security context.) There's a risk of a collision, especially since the hash gets truncated due to the 63 character limit on resource names. Unsure if this is better or not. Seems like the options are:

  1. Always redeploy due to generating a new job definition name each time seedfarmer is run. This risks a failed deploy it the monitoring jobs are running during a deploy.
  2. Deploy only when input parameters change. Risks a failed deploy if there's a hash collision based on the input parameters resulting in the same hash.

Is there another option? Is there an opinion about which option is better? 1 is safer, 2 is will result in less churn.

@clokep clokep marked this pull request as ready for review July 3, 2024 15:41
@clokep clokep requested a review from kukushking July 3, 2024 15:41
@kukushking
Copy link
Contributor

Thanks @clokep, I think option 2 works. I would not sweat too much about the risk in approach 1 if we instruct the user in the README to stop the schedule if there is a pending deployment, but I agree option 2 is less churn. We can always review and update this based on user feedback.

@clokep
Copy link
Collaborator Author

clokep commented Jul 3, 2024

Oh good call that we should document either way the failure about it running. I think it could still fail even with option 2, will need to try to do that on purpose to confirm and then document.

kukushking
kukushking previously approved these changes Jul 4, 2024
Copy link
Contributor

@kukushking kukushking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update CHANGELOG, README & ship it :)

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: AIOpsModuleTestsInfrastruct-doMLXEYsnmxr
  • Commit ID: 9f31e56
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@clokep
Copy link
Collaborator Author

clokep commented Jul 5, 2024

Please update CHANGELOG, README & ship it :)

We chatted a bit about this offline and will leave this out of the changelog since the unreleased changes already include that this module was just added.

@clokep clokep requested a review from kukushking July 5, 2024 12:22
@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: AIOpsModuleTestsInfrastruct-doMLXEYsnmxr
  • Commit ID: 19428e7
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

@kukushking kukushking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@clokep clokep merged commit 8d0c0ad into awslabs:main Jul 5, 2024
20 checks passed
@clokep clokep deleted the sagemaker-monitoring-redeploy branch July 5, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants