[Feature] A 'feature pipeline' template #72

athewsey · 2023-05-19T04:01:45Z

For my use case, we're looking for a deployable project template/stack which:

Creates & updates a SageMaker Feature Store Feature Group through the CI/CD pipeline, based on a configuration file, and
Creates and runs a sample SageMaker Pipeline to transform raw data (can assume S3 source) and ingest it into the Feature Group

We started from the SageMaker built-in 'model building and training' template, which already helps a lot with the SageMaker Pipeline / processing jobs aspect... But as far as I can tell there isn't one that covers managing a feature group (schema, tags, feature-level metadata) via CI/CD which is a shame.

I've put some initial thought into what a transferable sample could look like, but haven't had time yet to draft up an attempt!

Design ideas

Feature Group management

As of today, CloudFormation AWS::SageMaker::FeatureGroup gives a native mechanism to create/update/delete Feature Groups, but as far as I can tell it doesn't support feature-level metadata (parameters and descriptions) which are important for enterprise usage.
I suspect different organizations might wish to apply different levels of safety checks to the CloudFormation default, for actions that might result in deleting and replacing the Feature Group (I assume CFn wouldn't do automatic data replication in this case?)

For these two reasons, I was leaning towards defining a custom feature group+feature metadata configuration e.g. in JSON/YAML, and using custom Python code in CodeBuild to reconcile the current vs target feature group config and either perform the necessary updates or fail.

To avoid trying to overload parameters of the Service Catalog template itself, I was thinking such a template could take only basic parameters like the feature group name, ID field name and type, etc... Not creating any actual features until the initial 'seed code' config file is updated to add features in (since adding features to a group is a supported operation but removing them is not).

Feature transformation and ingestion

In our environment we're currently using two-step SageMaker Pipelines for feature engineering: One processing job to extract and transform raw data (from S3), and a separate one to do the feature store ingestion - to keep this component as easily re-usable as possible and separately right-size infrastructure. We've so far experimented with both PySparkProcessor and Data Wrangler for these steps, so that general architecture was my assumed pattern for the actual feature ingestion pipeline.

One or many feature groups?

I guess there's probably no harm in such a template theoretically supporting multiple pipelines, multiple feature groups - just like the model building & training template currently supports multiple pipelines... But our planned process is pretty much to deploy a separate project template per feature group at this stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] A 'feature pipeline' template #72

[Feature] A 'feature pipeline' template #72

athewsey commented May 19, 2023

[Feature] A 'feature pipeline' template #72

[Feature] A 'feature pipeline' template #72

Comments

athewsey commented May 19, 2023

Design ideas

Feature Group management

Feature transformation and ingestion

One or many feature groups?