Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I convert a Jupyter notebook into a Kedro project? #2461

Closed
yetudada opened this issue Mar 23, 2023 · 6 comments
Closed

How do I convert a Jupyter notebook into a Kedro project? #2461

yetudada opened this issue Mar 23, 2023 · 6 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@yetudada
Copy link
Contributor

Description

This actions part of #410 and describes a workflow where users learn how to convert a Jupyter notebook into a Kedro project. The scope of this work includes:

  • Creating the guide or tutorial
  • Deciding where to publish it

Context

Users often ask this question, if they have existing notebooks and want to use Kedro as part of their refactoring cycle.

We have seen interest in this from the views on these videos:

@yetudada yetudada added the Issue: Feature Request New feature or improvement to existing feature label Mar 23, 2023
@yetudada yetudada moved this to In Progress in Kedro Framework Mar 23, 2023
@noklam
Copy link
Contributor

noklam commented Mar 23, 2023

I like this! I have done it so many times :)

@NeroOkwa NeroOkwa moved this from In Progress to In Review in Kedro Framework Apr 3, 2023
@amandakys
Copy link

amandakys commented Apr 4, 2023

I've dumped to the content here just to see what it looks like formatted in github. If you'd like to give feedback, I've created a box note to make ti easier :) https://mckinsey.box.com/s/xvzwguj8hy37436xdgtold0me5kaiya3 and the link to the companion jupyter notebook that tutorial refers to can be found here https://mckinsey.box.com/s/86r7op40jx9i3oxy89unvpjyqsmld7bm


Have you just finished doing some modelling in a Jupyter notebook? Are you interested in how it might be converted into a Kedro project?

Being new to Kedro, I started my learning with the Kedro Spaceflights tutorial. However, it wasn’t immediately clear how the different steps in the data science process I was familiar with mapped to Kedro.

Hopefully, this step by step walkthrough helps. It starts with a Jupyter notebook of the Spaceflights project and walks through how we might convert it to a Kedro project, all while following the flow of a typical data science project.

  1. Import Data
  2. Process Data
  3. Train Model
  4. Evaluate Model

Kedro Setup

This tutorial assumes you have Python 3.7+ and conda installed on your machine. It also assumes basic knowledge of the terminal/command line and python virtual environments. We use conda but you are free to use your preferred alternative.

Set up a virtual environment for your project

It is recommended to create a new virtual environment for each new Kedro project to isolate it’s dependencies from those of other projects.

To create a new environment called kedro-spaceflights:

conda create --name kedro-spaceflights python=3.10 -y

You can replace the specific python version you have installed. Check your python version with: python3 --version (macOS and Linux) or python --version (Windows)

To activate the environment:

conda activate kedro-spaceflights

Install Kedro

pip install kedro

Create new Kedro project

kedro new

When prompted for a project name, enter Spaceflights. This should create a directory called spaceflights, which is your project’s root directory. The contents of the project should look like this (I’ve highlighted the files we’ll be editing (red) or creating (green) in this guide)

spaceflights        # Parent directory of the template
├── .gitignore      # Hidden file that prevents staging of unnecessary files to `git`
├── conf            # Project configuration files
    ├── local
    └── base
        ├── catalog.yml
        ├── logging.yml
        └── parameters.yml
├── data            # Local project data (not committed to version control)
├── docs            # Project documentation
├── logs            # Project output logs (not committed to version control)
├── notebooks       # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src)
├── pyproject.toml  # Identifies the project root and contains configuration information
├── README.md       # Project README
├── setup.cfg       # Configuration options for `pytest` when doing `kedro test` and for the `isort` utility when doing `kedro lint`
└── src             # Project source code
    ├── tests
    ├── setup.py
    ├── requirements.txt
    └── spaceflights
        └── pipelines
            └── data-processing
                ├── nodes.py
                └── pipelines.py
            └── data-science
                ├── nodes.py
                └── pipelines.py

Now you are ready to start converting your Jupyter notebook into a brand new Kedro project!


1. Import Data

In our Jupyter notebook, we begin by reading in our data sources.

import pandas as pd

companies = pd.read_csv('data/companies.csv')
reviews = pd.read_csv('data/reviews.csv')
shuttles = pd.read_excel('data/shuttles.xlsx', engine='openpyxl')

In Kedro, we store our csv files in data/01_raw. Go ahead and copy those files into that directory.

Registering our Data

For our Kedro project to be able to access and use that data, we then have to ‘register our datasets’. It’s a little more complicated than calling pd.read_csv, but not by much.

In the conf/base/catalog.yml , we simply add:

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

reviews:
  type: pandas.CSVDataSet
  filepath: data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)

2. Process Data

Now that our data is loaded, we can start processing it to prepare for model building.

In our Jupyter notebook, we have 3 cells, fixing up different columns in each dataset. Then we merge them into one big table suitable for input into a model.

# process data from companies.csv
companies["iata_approved"] = companies["iata_approved"] == "t"
companies["company_rating"] = companies["company_rating"].str.replace("%", "")
companies["company_rating"] = companies["company_rating"].astype(float) / 100
# process data from shuttles.xlsx
shuttles["d_check_complete"] = shuttles["d_check_complete"] == "t" 
shuttles["moon_clearance_complete"] = shuttles["moon_clearance_complete"] == "t"
shuttles["price"] = shuttles["price"].str.replace("$", "").str.replace(",", "")
shuttles["price"] = shuttles["price"].astype(float)
# join three datasets into single model input table
rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id")
model_input_table = rated_shuttles.merge(
    companies, left_on="company_id", right_on="id"
)
model_input_table = model_input_table.dropna()
model_input_table.head()

Creating Nodes

In Kedro, each of these cells would become a function, which is then encapsulated by a Node to be placed in pipeline. A function, given an input (in this case, a dataset), performs a set of actions, to generate some output (in this case, a cleaned dataset). It’s behaviour is consistent, repeatable and predictable, putting the same dataset into a Node will always return the same cleaned dataset.

In src/kedro-spaceflights/data-processing create a file called nodes.py. Here we define two functions preprocess_companies and preprocess_shuttles. They take in a pandas DataFrame and perform the same transformations to return a cleaned DataFrame. Don’t forget to include the relevant imports at the top of the file.

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:

    companies["iata_approved"] = companies["iata_approved"] == "t"
		companies["company_rating"] = companies["company_rating"].str.replace("%", "")
		companies["company_rating"] = companies["company_rating"].astype(float) / 100
    return companies

def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:

    shuttles["d_check_complete"] = shuttles["d_check_complete"] == "t" 
		shuttles["moon_clearance_complete"] = shuttles["moon_clearance_complete"] == "t"
		shuttles["price"] = shuttles["price"].str.replace("$", "").str.replace(",", "")
		shuttles["price"] = shuttles["price"].astype(float)
    return shuttles

As you can see, the contents of these functions can be copied directly from the relevant cells in our Jupyter notebook, but some minor improvements could be made to improve code quality.

We extract some utility functions to reduce code duplication: (these can be pasted into the top of nodes.py)

def _is_true(x: pd.Series) -> pd.Series:
    return x == "t"

def _parse_percentage(x: pd.Series) -> pd.Series:
    x = x.str.replace("%", "")
    x = x.astype(float) / 100
    return x

def _parse_money(x: pd.Series) -> pd.Series:
    x = x.str.replace("$", "").str.replace(",", "")
    x = x.astype(float)
    return x

Then we adjust our preprocessing functions to use these utility functions.

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:

    companies["iata_approved"] = _is_true(companies["iata_approved"])
    companies["company_rating"] = _parse_percentage(companies["company_rating"])
    return companies

def preprocess_shuttles(shuttles: pd.DataFrame) -> pd.DataFrame:

    shuttles["d_check_complete"] = _is_true(shuttles["d_check_complete"])
    shuttles["moon_clearance_complete"] = _is_true(shuttles["moon_clearance_complete"])
    shuttles["price"] = _parse_money(shuttles["price"])
    return shuttles
💡 ****Atomicity of functions**** Functions should have an atomic design, meaning they should do only one thing, which should be described in the function name. The scope of this ‘one thing’ is difficult to define; it’s reasonable for it to take multiple steps, but the broader the scope, the longer and more complex the function becomes. Should a particular step be performed repeatedly, or by several other functions, it might be worth considering separating that step out into its own function.

Assembling the Pipeline

Now that we have moved our data pre-processing functions into the Kedro framework, how do we tell it the order in which to execute them? In a Jupyter notebook, two functions in the same cell would execute one after the other. If they were in different cells, we could chose to run them in cell order, but also in any order if we run each cell manually.

When it comes to data-processing, it is easy to see why executing cells in a specific order is important, as we do not want to construct model_input_table with unprocessed datasets. With Kedro, the pipeline controls the order of execution. By assembling nodes into a pipeline, Kedro ensures that these nodes are executed in the predefined order.

The pipeline is assembled in the create_pipeline function, which we define in src/spaceflights/data-processing/pipeline.py

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            )
        ]
    )

This creates a pipeline that first calls the preprocess_companies function, followed by the preprocess_shuttles. The functions will need to be imported as dependencies from nodes.py using **from** **.nodes** **import** preprocess_companies, preprocess_shuttles

Persisting Output Datasets

Each node defines an output, in this case a processed dataset. If we want these to persist beyond each run of the pipeline, we need to register the dataset. This is similar to how we registered our input datasets in 1. Import Data

We do this by adding them to the Data Catalog. In our conf/base/catalog.yml file we add:

preprocessed_companies:
  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/preprocessed_companies.pq

preprocessed_shuttles:
  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/preprocessed_shuttles.pq

This registers them as Parquet Datasets and the outputs from the corresponding nodes will be saved into them on every run.

If we choose not to register a dataset, the data will be stored in memory as temporary Python objects during the run and cleared after the run is complete.

Creating the Model Input Table

Next, we will use creating a node that outputs the model input table as an example to walk through all the steps we need to add a node to pipeline.

1. Create Node Function (nodes.py)

We take our model_input_table code from the Jupyter Notebook…

# join three datasets into single model input table
rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id")
model_input_table = rated_shuttles.merge(companies, left_on="company_id", right_on="id")
model_input_table = model_input_table.dropna()
model_input_table.head()

and wrap it as a function called create_model_input_table , and append it to data_processing/nodes.py.

def create_model_input_table(
    shuttles: pd.DataFrame, companies: pd.DataFrame, reviews: pd.DataFrame
) -> pd.DataFrame:
    rated_shuttles = shuttles.merge(reviews, left_on="id", right_on="shuttle_id")
    model_input_table = rated_shuttles.merge(companies, left_on="company_id", right_on="id")
    model_input_table = model_input_table.dropna()
    return model_input_table

Here we see that the function takes the 3 relevant datasets as inputs and outputs the model input table

2. Add Node to Pipeline (pipeline.py)

In data_processing/pipeline.py we create a node that calls the create_model_input_table function and append it to the list of nodes that this pipeline executes.

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ]
    )

3. Register the Output Dataset (catalog.yml)

We want the model input table to persist beyond each individual run as it is ostensibly the final output of our data-processing pipeline. We register it by appending the following to conf/base/catalog.yml

model_input_table:
  type: pandas.ParquetDataSet
  filepath: data/03_primary/model_input_table.pq

At this point our catalog.yml file looks like this:

# Node Inputs 

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

reviews:
  type: pandas.CSVDataSet
  filepath: data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)

# Node Outputs

preprocessed_companies:
  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/preprocessed_companies.pq

preprocessed_shuttles:
  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/preprocessed_shuttles.pq

model_input_table:
  type: pandas.ParquetDataSet
  filepath: data/03_primary/model_input_table.pq

🎉 Congratulations! We have just finished writing our data-processing pipeline!

💡 **************Quick Recap************** At this point you should have created/changed the following files - `conf/base/catalog.yml` (to register datasets) - `src/spaceflights/data-processing/nodes.py` (to define what to do) - `src/spaceflights/data-processing/pipeline.py` (to define the order in which they are done)

3. Train/Evaluate Model

Now that we’ve successfully created one pipeline. Let’s use the next part of the Jupyter notebook (starting from 4. Data Modelling), where we train the model, as an exercise. These cells will be ported into our Kedro project as part of a new pipeline, the data-science pipeline.

You’ll need to create a new directory called data_science in src/spaceflights/pipelines

Inside, you’ll need two files nodes.py and pipeline.py

Go ahead and see if you can port the cells in 4. Data Modelling into the relevant files to create a pipeline. Don’t forget to use these steps as reference:

  1. Create Node Function (node.py)
  2. Add Node to Pipeline (pipeline.py)
  3. Register Input/Output Datasets (catalog.yml)

Check your work

# node.py

import logging
from typing import Tuple

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

def split_data(model_input_table: pd.DataFrame) -> Tuple:

    features = ["engines", 
                "passenger_capacity",
		"crew",
		"d_check_complete",
		"moon_clearance_complete",
		"iata_approved",
		"company_rating",
	        "review_scores_rating"
    ]
    X = model_input_table[features]
    y = model_input_table["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=3
    )
    return X_train, X_test, y_train, y_test

def train_model(X_train: pd.DataFrame, y_train: pd.Series) -> LinearRegression:

    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor

def evaluate_model(
    regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series
):

    y_pred = regressor.predict(X_test)
    score = r2_score(y_test, y_pred)
    logger = logging.getLogger(__name__)
    logger.info("Model has a coefficient R^2 of %.3f on test data.", score)
💡 You’ll notice here that each function corresponds to one step in the modelling process: 1. split data, where we call the `train_test_split` function 2. train model, where we call `.fit()` 3. evaluate model, where we call `.predict()` on the test data and calculate an evaluation metric.
# pipeline.py

from kedro.pipeline import Pipeline, node, pipeline

from .nodes import evaluate_model, split_data, train_model

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )
# catalog.yml

.
.
# registered datasets from previous work
.
.

regressor:
  type: pickle.PickleDataSet
  filepath: data/06_models/regressor.pickle
  versioned: true

# If you chose to register all your outputs, like X_train, X_test etc. That is a valid choice.
# It is just particularly important that you persist the model output 

Configuring Parameters

If you took it straight from the Jupyter notebook, your split_data node function might look something like this:

def split_data(model_input_table: pd.DataFrame) -> Tuple:

   features = ["engines", 
                "passenger_capacity",
		"crew",
		"d_check_complete",
		"moon_clearance_complete",
		"iata_approved",
		"company_rating",
	        "review_scores_rating"
    ]
    X = model_input_table[features]
    y = model_input_table["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=3
    )
    return X_train, X_test, y_train, y_test
💡 **************************Magic Numbers************************** Magic numbers are unique values with arbitrary meaning or multiple occurrences which could (preferably) be replaced with a named constant.

Good software engineering practice suggests that we extract ‘magic numbers’ into named constants, sometimes defined at the top of a file, or outside in utility file. **In this example, the numbers supplied to test_size and random_state are 0.3 and 3. These are values that we can imagine might need to be used elsewhere, or quickly changed. If we had test_size=0.3 dotted around our code, we’d have to find and replace them all if we wanted to change them. Wouldn’t it be convenient if there was somewhere to collect these values for ease of inspection and modification?

We now introduce conf/base/parameters.yml Here we can define parameters that can be used when running the pipeline. They can be defined here and used when writing nodes and pipelines.

# parameters.yml

model_options:
  test_size: 0.2
  random_state: 3

To use these parameters in our split_data function:

def split_data(model_input_table: pd.DataFrame, parameters: Dict) -> Tuple:

    features = ["engines", 
                "passenger_capacity",
		"crew",
		"d_check_complete",
		"moon_clearance_complete",
		"iata_approved",
		"company_rating",
	        "review_scores_rating"
    ]
    X = model_input_table[features]
    y = model_input_table["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test

If we extend the concept of magic numbers we might also see that the variable features also seems like it might be reusable elsewhere.

# parameters.yml 

model_options:
  test_size: 0.2
  random_state: 3
  features:
    - engines
    - passenger_capacity
    - crew
    - d_check_complete
    - moon_clearance_complete
    - iata_approved
    - company_rating
    - review_scores_rating

Using features as a parameter in split_data would look like:

def split_data(model_input_table: pd.DataFrame, parameters: Dict) -> Tuple:
    X = model_input_table[parameters["features"]]
    y = model_input_table["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test

It can be easy for our parameters.yml file to grow quickly, you may find it easier to keep them organised by having a separate parameters file per pipeline, i.e. a file called data_science.yml for parameters used in this data-science pipeline. In this case, we would rename our parameters.yml file data_science.yml and store it in a parameters folder. Our project directory would look like this:

spaceflights        # Parent directory of the template
├── conf            # Project configuration files
    ├── local
    └── base
	├── catalog.yml
        ├── logging.yml
        └── parameters
	    └── data_science.yml
	    └── data_processing.yml
.
.
# the rest of the project dir
.
.

4. Next Steps

Kedro pipelines are a very powerful tool which can help you make your data science code less error-prone and more reusable, helping you do more with less!

Here are some resources to help you learn more about them:

  • Running just one of multiple pipelines in a project
  • Modular Pipelines
  • Running from or to a specific node in a pipeline

@astrojuanlu
Copy link
Member

One thing to consider in this workflow:

As a user, my starting point could be

  • Some data
  • A notebook analyzing it
  • The needed requirements to get it working

(not all projects will have their requirements explicitly declared, but they will have implicit requirements nonetheless).

If my requirements.txt looks like this:

jupyterlab
jupyterlab-lsp
python-lsp-server[all]

(which is JupyterLab, plus the Language Server Protocol extensions for JLab and Python, which add IDE-like features to JLab https://github.com/jupyter-lsp/jupyterlab-lsp)

by default, pip and any other tool will install the latest versions. However, kedro new would pull a src/requirements.txt containing some upper bounds:

jupyter~=1.0
jupyterlab_server>=2.11.1, <2.16.0
jupyterlab~=3.0, <3.6.0

Reconciling these requirements is a bit of work, which I'm not sure people other than Python packaging nerds could successfully do.

At the moment I don't have specific proposals on how to address this, but wanted to write down the problem anyway.

Related: #2276

@stichbury
Copy link
Contributor

Just to comment that I've promised to take a look at the draft on Notion and feedback received (plus the work that @astrojuanlu on a similar project with a notebook using Polars, and the MLOps article shared on Slack) to work out next steps.

I won't get to this until sometime in w/c 24/04 since I'm on leave after today.

@stichbury stichbury self-assigned this Apr 19, 2023
@astrojuanlu
Copy link
Member

astrojuanlu commented Apr 19, 2023

@stichbury
Copy link
Contributor

I'm closing this ticket now as we have a new direction for the post -- it'll be split into a series of 4 smaller posts and these will be published on the blog in turn. See kedro-org/kedro-devrel#80 for more detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

5 participants