A package for dbt which enables standardization of data sets. You can use it to build a feature store in your data warehouse, without using external libraries like Spark's mllib or Python's scikit-learn.
The package contains a set of macros that mirror the functionality of the scikit-learn preprocessing module. Originally they were developed as part of the 2019 Medium article Feature Engineering in Snowflake.
Currently they have been tested in Snowflake, Redshift and BigQuery. The test case expectations have been built using scikit-learn (see *.py in integration_tests/data/sql), so you can expect behavioural parity with it.
The macros are:
scikit-learn function | macro name | Snowflake | BigQuery | Redshift | Example |
---|---|---|---|---|---|
KBinsDiscretizer | k_bins_discretizer | Y | Y | Y | |
LabelEncoder | label_encoder | Y | Y | Y | |
MaxAbsScaler | max_abs_scaler | Y | Y | Y | |
MinMaxScaler | min_max_scaler | Y | Y | Y | |
Normalizer | normalizer | Y | Y | Y | |
OneHotEncoder | one_hot_encoder | Y | Y | Y | |
QuantileTransformer | quantile_transformer | Y | Y | N | |
RobustScaler | robust_scaler | Y | Y | Y | |
StandardScaler | standard_scaler | Y | Y | Y |
* 2D charts taken from scikit-learn.org, GIFs are my own
To use this in your dbt project, create or modify packages.yml to include:
packages:
- package: "omnata-labs/dbt_ml_preprocessing"
version: [">=1.0.0"]
(replace the revision number with the latest)
Then run:
dbt deps
to import the package.
To read the macro documentation and see examples, simply generate your docs, and you'll see macro documentation in the Projects tree under dbt_ml_preprocessing
: