Skip to content

Latest commit

 

History

History
42 lines (30 loc) · 4.15 KB

README.md

File metadata and controls

42 lines (30 loc) · 4.15 KB

dbt-ml-preprocessing

A package for dbt which enables standardization of data sets. You can use it to build a feature store in your data warehouse, without using external libraries like Spark's mllib or Python's scikit-learn.

The package contains a set of macros that mirror the functionality of the scikit-learn preprocessing module. Originally they were developed as part of the 2019 Medium article Feature Engineering in Snowflake.

Currently they have been tested in Snowflake, Redshift and BigQuery. The test case expectations have been built using scikit-learn (see *.py in integration_tests/data/sql), so you can expect behavioural parity with it.

The macros are:

scikit-learn function macro name Snowflake BigQuery Redshift Example
KBinsDiscretizer k_bins_discretizer Y Y Y example
LabelEncoder label_encoder Y Y Y example
MaxAbsScaler max_abs_scaler Y Y Y example
MinMaxScaler min_max_scaler Y Y Y example
Normalizer normalizer Y Y Y example
OneHotEncoder one_hot_encoder Y Y Y example
QuantileTransformer quantile_transformer Y Y N example
RobustScaler robust_scaler Y Y Y example
StandardScaler standard_scaler Y Y Y example

* 2D charts taken from scikit-learn.org, GIFs are my own

Installation

To use this in your dbt project, create or modify packages.yml to include:

packages:
  - package: "omnata-labs/dbt_ml_preprocessing"
    version: [">=1.0.0"]

(replace the revision number with the latest)

Then run: dbt deps to import the package.

Usage

To read the macro documentation and see examples, simply generate your docs, and you'll see macro documentation in the Projects tree under dbt_ml_preprocessing:

docs screenshot