Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] A 'feature pipeline' template #72

Open
athewsey opened this issue May 19, 2023 · 0 comments
Open

[Feature] A 'feature pipeline' template #72

athewsey opened this issue May 19, 2023 · 0 comments

Comments

@athewsey
Copy link

For my use case, we're looking for a deployable project template/stack which:

  • Creates & updates a SageMaker Feature Store Feature Group through the CI/CD pipeline, based on a configuration file, and
  • Creates and runs a sample SageMaker Pipeline to transform raw data (can assume S3 source) and ingest it into the Feature Group

We started from the SageMaker built-in 'model building and training' template, which already helps a lot with the SageMaker Pipeline / processing jobs aspect... But as far as I can tell there isn't one that covers managing a feature group (schema, tags, feature-level metadata) via CI/CD which is a shame.

I've put some initial thought into what a transferable sample could look like, but haven't had time yet to draft up an attempt!

Design ideas

Feature Group management

  1. As of today, CloudFormation AWS::SageMaker::FeatureGroup gives a native mechanism to create/update/delete Feature Groups, but as far as I can tell it doesn't support feature-level metadata (parameters and descriptions) which are important for enterprise usage.
  2. I suspect different organizations might wish to apply different levels of safety checks to the CloudFormation default, for actions that might result in deleting and replacing the Feature Group (I assume CFn wouldn't do automatic data replication in this case?)

For these two reasons, I was leaning towards defining a custom feature group+feature metadata configuration e.g. in JSON/YAML, and using custom Python code in CodeBuild to reconcile the current vs target feature group config and either perform the necessary updates or fail.

To avoid trying to overload parameters of the Service Catalog template itself, I was thinking such a template could take only basic parameters like the feature group name, ID field name and type, etc... Not creating any actual features until the initial 'seed code' config file is updated to add features in (since adding features to a group is a supported operation but removing them is not).

Feature transformation and ingestion

In our environment we're currently using two-step SageMaker Pipelines for feature engineering: One processing job to extract and transform raw data (from S3), and a separate one to do the feature store ingestion - to keep this component as easily re-usable as possible and separately right-size infrastructure. We've so far experimented with both PySparkProcessor and Data Wrangler for these steps, so that general architecture was my assumed pattern for the actual feature ingestion pipeline.

One or many feature groups?

I guess there's probably no harm in such a template theoretically supporting multiple pipelines, multiple feature groups - just like the model building & training template currently supports multiple pipelines... But our planned process is pretty much to deploy a separate project template per feature group at this stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant