dbt-ml-preprocessing

A package for dbt which enables standardization of data sets. You can use it to build a feature store in your data warehouse, without using external libraries like Spark's mllib or Python's scikit-learn.

The package contains a set of macros that mirror the functionality of the scikit-learn preprocessing module. Originally they were developed as part of the 2019 Medium article Feature Engineering in Snowflake.

Currently they have been tested in Snowflake, Redshift and BigQuery. The test case expectations have been built using scikit-learn (see *.py in integration_tests/data/sql), so you can expect behavioural parity with it.

The macros are:

scikit-learn function	macro name	Snowflake	BigQuery	Redshift
KBinsDiscretizer	k_bins_discretizer	Y	Y	Y
LabelEncoder	label_encoder	Y	Y	Y
MaxAbsScaler	max_abs_scaler	Y	Y	Y
MinMaxScaler	min_max_scaler	Y	Y	Y
Normalizer	normalizer	Y	Y	Y
OneHotEncoder	one_hot_encoder	Y	Y	Y
QuantileTransformer	quantile_transformer	Y	Y	N
RobustScaler	robust_scaler	Y	Y	Y
StandardScaler	standard_scaler	Y	Y	Y

* 2D charts taken from scikit-learn.org, GIFs are my own

Installation

To use this in your dbt project, create or modify packages.yml to include:

packages:
  - package: "omnata-labs/dbt_ml_preprocessing"
    version: [">=1.0.0"]

(replace the revision number with the latest)

Then run: dbt deps to import the package.

Usage

To read the macro documentation and see examples, simply generate your docs, and you'll see macro documentation in the Projects tree under dbt_ml_preprocessing:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dbt-ml-preprocessing

Installation

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

dbt-ml-preprocessing

Installation

Usage