Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2917] Set up experiment tracking tutorial #1144

Merged
merged 22 commits into from
Jan 17, 2022
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
5a45d2b
set up experiment tracking tutorial
studioswong Jan 8, 2022
7caf926
added release note
studioswong Jan 8, 2022
2f53cf5
Merge branch 'main' into feature/setup-experiment-tracking-tutorial
studioswong Jan 8, 2022
552400a
first set of comments
studioswong Jan 12, 2022
6d242cf
second set of comments
studioswong Jan 12, 2022
1791c92
third round of comments
studioswong Jan 12, 2022
b516061
addition of demo gif in intro section
studioswong Jan 12, 2022
44f8b62
Merge branch 'main' into feature/setup-experiment-tracking-tutorial
studioswong Jan 12, 2022
93e14c5
minor wording changes
studioswong Jan 12, 2022
5da2bbb
further minor changes
studioswong Jan 12, 2022
429752e
update wording changes and added experiment icon
studioswong Jan 13, 2022
ca1c687
minor wording change
studioswong Jan 13, 2022
3eb06a9
Merge branch 'main' into feature/setup-experiment-tracking-tutorial
studioswong Jan 13, 2022
f6ce6b3
minor formatting change
studioswong Jan 13, 2022
5199711
further wording changes and adding new doc to index.rst
studioswong Jan 14, 2022
3cc6c67
update experiment tracking icon
studioswong Jan 14, 2022
3b45700
updated release notes, edited plotly icon
studioswong Jan 14, 2022
e1856dc
rename icon
studioswong Jan 14, 2022
3806c3b
lint changes
studioswong Jan 17, 2022
ccc8488
futher minor wording changes
studioswong Jan 17, 2022
f2fe994
further minor change
studioswong Jan 17, 2022
1e3176b
Merge branch 'main' into feature/setup-experiment-tracking-tutorial
studioswong Jan 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# Release 0.17.7
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there is no new section on the release notes for the upcoming release, and had temporarily named the upcoming release as 0.17.7 for now - feel free to suggest any new version otherwise


## Bug fixes and other changes
* Added tutorial documentation for experiment tracking (`03_tutorial/07_set_up_experiment_tracking.md`).

# Release 0.17.6

## Major features and improvements
Expand Down
156 changes: 156 additions & 0 deletions docs/source/03_tutorial/07_set_up_experiment_tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Set up experiment tracking

Experiment tracking is the process of saving all machine-learning related experiment information so that it is easy to find and compare past runs. [Kedro-Viz](https://github.com/quantumblacklabs/kedro-viz) supports native experiment tracking from [version 4.1.1](https://github.com/quantumblacklabs/kedro-viz/releases/tag/v4.1.1) onwards. When experiment tracking is enabled in your Kedro project, you will be able to access, edit and compare your experiments directly from the Kedro-Viz web app.

![](../meta/images/experiment-tracking_demo_small.gif)

Enabling experiment tracking features on Kedro-Viz relies on [setting up a session store to capture experiment metadata](#set-up-session-store), [experiment tracking datasets to let Kedro know what metrics should be tracked](#set-up-tracking-datasets) and [modifying your nodes and pipelines to output those metrics](#setting-up-your-nodes-and-pipelines-to-log-metrics).
studioswong marked this conversation as resolved.
Show resolved Hide resolved

This tutorial will provide a step-by-step process to set up experiment tracking and access your logged metrics from each run on Kedro-Viz. It will use the spaceflights starter project that is outlined in [this tutorial](../03_tutorial/01_spaceflights_tutorial.md). You can also jump directly to [this section for direct reference in setting up experiment tracking](../03_tutorial/02_experiment_tracking.md) for your Kedro project.
studioswong marked this conversation as resolved.
Show resolved Hide resolved
studioswong marked this conversation as resolved.
Show resolved Hide resolved

You can also access a more detailed demo [here](https://kedro-viz-live-demo.hfa4c8ufrmn4u.eu-west-2.cs.amazonlightsail.com/).

## Project setup
studioswong marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Project setup
## Set up a project

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so it matches the other headers you have created. They have all followed the format of "Set up ..."


We assume that you have already [installed Kedro](../02_get_started/02_install.md) and [Kedro-Viz](../03_tutorial/06_visualise_pipeline.md). Set up a new Kedro project using spaceflights starter by running:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We assume that you have already [installed Kedro](../02_get_started/02_install.md) and [Kedro-Viz](../03_tutorial/06_visualise_pipeline.md). Set up a new Kedro project using spaceflights starter by running:
We assume that you have already [installed Kedro](../02_get_started/02_install.md) and [Kedro-Viz](../03_tutorial/06_visualise_pipeline.md). Set up a new Kedro project using the spaceflights starter by running:


```bash
kedro new --starter=spaceflights
```

Feel free to name your project as you like, but this guide will assume the project is named **Kedro Experiment Tracking Tutorial**, and that your project is in a sub-folder in your working directory that was created by `kedro new`, named `kedro-experiment-tracking-tutorial`. Keep the default names for the repo_name and python_package when prompted by pressing the enter key.
studioswong marked this conversation as resolved.
Show resolved Hide resolved

## Set up session store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Set up session store
## Set up the session store

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised there is a definite article missing here and have also updated @rashidakanchwala's above comment so that the hyperlink will still work.


In the domain of experiment tracking, each pipeline run is considered a session. A session store records all related metadata for each pipeline run, from logged metrics to other run-related data such as timestamp, git username and branch. The session store is a SQLite database that gets generated during your first pipeline run after it has been set up in your project.
merelcht marked this conversation as resolved.
Show resolved Hide resolved

To set up the session store, go to the `src/settings.py` file and add the following:

```python
from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
from pathlib import Path
SESSION_STORE_CLASS = SQLiteStore
SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / "data")}
```

yetudada marked this conversation as resolved.
Show resolved Hide resolved
This will specify the creation of the SQLiteStore under the `/data` subfolder, using the SQLiteStore setup from your installed Kedro-Viz plugin. (Please ensure that your installed version of Kedro-Viz is at least version 4.1.1 onwards). This step is crucial for enabling experiment tracking features on Kedro-Viz as it is the database used to serve all run data to the Kedro-Viz front end.
studioswong marked this conversation as resolved.
Show resolved Hide resolved
studioswong marked this conversation as resolved.
Show resolved Hide resolved
## Set up tracking datasets

There are two types of tracking datasets: [`tracking.MetricsDataSet`](/kedro.extras.datasets.tracking.MetricsDataSet) and [`tracking.JSONDataSet`](/kedro.extras.datasets.tracking.JSONDataSet). The `tracking.MetricsDataSet` should be used for tracking numerical metrics, and the `tracking.JSONDataSet` can be used for tracking any other JSON-compatible data like boolean or text-based data.

Let's set up the following 2 datasets to log our r2 scores and parameters for each run by adding the following in `catalog.yml` under `/conf/base`:
Copy link
Member

@merelcht merelcht Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's set up the following 2 datasets to log our r2 scores and parameters for each run by adding the following in `catalog.yml` under `/conf/base`:
Set up two datasets to log `r2 scores` and `parameters` for each run by adding the following in the `conf/base/catalog.yml` file:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should setting up catalog.yml come after the node and pipeline section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order doesn't really matter, so I'd leave it like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rashidakanchwala might have a point in terms of workflow but we can indicate that with a note. I've added a note in the previous section.


```yaml
metrics:
type: tracking.MetricsDataSet
filepath: data/09_tracking/metrics.json

companies_columns:
type: tracking.JSONDataSet
filepath: data/09_tracking/companies_columns.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metrics:
type: tracking.MetricsDataSet
filepath: data/09_tracking/metrics.json
companies_columns:
type: tracking.JSONDataSet
filepath: data/09_tracking/companies_columns.json
metrics:
type: tracking.MetricsDataSet
filepath: data/09_tracking/metrics.json
companies_columns:
type: tracking.JSONDataSet
filepath: data/09_tracking/companies_columns.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think YAML uses two spaces for indents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep the prettier on my IDE keeps messing my formatting 😆 I've updated this

```

## Setting up your nodes and pipelines to log metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Setting up your nodes and pipelines to log metrics
## Set up your nodes and pipelines to log metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removed the gerund, the "ing" from the header. I made this suggestion on the other headers too.


Now that we have set up the tracked datasets to log our experiment tracking data, the next step is to ensure that the data is returned from your nodes.
Copy link
Member

@merelcht merelcht Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Now that we have set up the tracked datasets to log our experiment tracking data, the next step is to ensure that the data is returned from your nodes.
Now that you have set up the tracking datasets to log experiment tracking data, the next step is to ensure that the data is returned from your nodes.


Let's set up the data to be logged for the metrics dataset - under `nodes.py` of your `data_processing` pipeline (`/src/kedro-experiment-tracking-tutorial/pipelines/data_processing/nodes.py`), modify your `evaluate_model` function by adding in three different metrics: `score` to log your r2 score, `mae` to log your mean absolute error, and `me` to log your max error, and returning those 3 metrics as a key value pair.
Copy link
Member

@merelcht merelcht Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's set up the data to be logged for the metrics dataset - under `nodes.py` of your `data_processing` pipeline (`/src/kedro-experiment-tracking-tutorial/pipelines/data_processing/nodes.py`), modify your `evaluate_model` function by adding in three different metrics: `score` to log your r2 score, `mae` to log your mean absolute error, and `me` to log your max error, and returning those 3 metrics as a key value pair.
Set up the data to be logged for the `metrics` dataset - under `nodes.py` of your `data_processing` pipeline (`src/kedro-experiment-tracking-tutorial/pipelines/data_processing/nodes.py`), modify your `evaluate_model` function by adding in three different metrics:
- `score` to log a r2 score
- `mae` to log an mean absolute error
- `me` to log the max error
You will return those three metrics as a key-value pair.


The new `evaluate_model` function would look like this:

```python
def evaluate_model(
regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series
) -> Dict[str, float]:
"""Calculates and logs the coefficient of determination.

Args:
regressor: Trained model.
X_test: Testing data of independent features.
y_test: Testing data for price.
"""
y_pred = regressor.predict(X_test)
score = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
me = max_error(y_test, y_pred)
logger = logging.getLogger(__name__)
logger.info("Model has a coefficient R^2 of %.3f on test data.", score)
return {"r2_score": score, "mae": mae, "max_error": me}
```

The next step is to ensure that the dataset is also specified as an output of your `evaluate_model` node. Under `/pipelines/data_processing/pipeline.py`, specify the `output` of your `evaluate_model` to be the `metrics` dataset. Note that it is crucial that the output dataset exactly matches the name of the tracking dataset specified in the catalog file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The next step is to ensure that the dataset is also specified as an output of your `evaluate_model` node. Under `/pipelines/data_processing/pipeline.py`, specify the `output` of your `evaluate_model` to be the `metrics` dataset. Note that it is crucial that the output dataset exactly matches the name of the tracking dataset specified in the catalog file.
The next step is to ensure that the dataset is also specified as an output of your `evaluate_model` node. Under `src/kedro-experiment-tracking-tutorial/pipelines/data_processing/pipeline.py`, specify the `output` of your `evaluate_model` to be the `metrics` dataset. Note that it is crucial that the output dataset exactly matches the name of the tracking dataset specified in the catalog file.


The node of the `evaluate_model` on the pipeline should look like this:

```python
node(
func=evaluate_model,
inputs=["regressor", "X_test", "y_test"],
name="evaluate_model_node",
outputs="metrics"
)
```

You have to repeat the same steps for setting up the `companies_column` dataset. For this dataset you should log the column that contains the list of companies as outlined in `companies.csv` under `/data/01_raw`. Modify the `preprocess_companies` function under the `data_processing` pipeline to return the data under a key value pair, as shown below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You have to repeat the same steps for setting up the `companies_column` dataset. For this dataset you should log the column that contains the list of companies as outlined in `companies.csv` under `/data/01_raw`. Modify the `preprocess_companies` function under the `data_processing` pipeline to return the data under a key value pair, as shown below:
You have to repeat the same steps for setting up the `companies_column` dataset. For this dataset you should log the column that contains the list of companies as outlined in `companies.csv` under `data/01_raw`. Modify the `preprocess_companies` node under the `data_processing` pipeline (`src/kedro-experiment-tracking-tutorial/pipelines/data_processing/nodes.py`) to return the data under a key-value pair, as shown below:


```python
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
"""Preprocesses the data for companies.

Args:
companies: Raw data.
Returns:
Preprocessed data, with `company_rating` converted to a float and
`iata_approved` converted to boolean.
"""
companies["iata_approved"] = _is_true(companies["iata_approved"])
companies["company_rating"] = _parse_percentage(companies["company_rating"])
return companies, {"columns": companies.columns.tolist(), "data_type": "companies"}
```

Again, you will need to ensure that the dataset is also specified as an output on `pipeline.py` under the `data_processing` pipeline, as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Again, you will need to ensure that the dataset is also specified as an output on `pipeline.py` under the `data_processing` pipeline, as follows:
Again, you will need to ensure that the dataset is also specified as an output on `pipeline.py` under the `data_processing` pipeline (`src/kedro-experiment-tracking-tutorial/pipelines/data_processing/pipeline.py`), as follows:


```python
node(
func=preprocess_companies,
inputs="companies",
outputs=["preprocessed_companies", "companies_columns"],
name="preprocess_companies_node",
)
```

Having set up both datasets, you are now ready to generate your first set of experiment tracking data!

## Generating Run data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Generating Run data
## Generate run data


One of the beauty of native experiment tracking in Kedro is that all tracked data are generated and stored each time you do a Kedro run. Hence, to generat the data, simply do:
Copy link
Member

@merelcht merelcht Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One of the beauty of native experiment tracking in Kedro is that all tracked data are generated and stored each time you do a Kedro run. Hence, to generat the data, simply do:
The beauty of native experiment tracking in Kedro is that all tracked data is generated and stored each time you do a Kedro run. Hence, to generate the data, you only need to execute:


```bash
Kedro run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kedro run
kedro run

```

After the run completes, under `data/09_tracking`, you will now see 2 folders, `companies_column.json` and `metrics.json`. On performing a pipeline run after setting up the tracking datasets, Kedro will generate a folder under the dataset name for each tracked dataset, with each folder containing another folder named under the timestamp of the run. Each subsequent pipeline run will generate a new folder named after the timestamp of the run, in the same directory of the tracked dataset.
merelcht marked this conversation as resolved.
Show resolved Hide resolved

You will also see the `session_store.db` generated from your first pipeline run after enabling experiment tracking, which is used to store all the generated run metadata, alongside the tracking dataset, to be used for exposing experiment tracking to Kedro-Viz.

![](../meta/images/experiment-tracking_folder.png)

Try to execute `kedro run` a few times to generate a larger set of experiment data. You can also play around with setting up different tracking datasets, and check the logged data via the generated JSON data files.

## Accessing Run data and comparing runs on Kedro-Viz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Accessing Run data and comparing runs on Kedro-Viz
## Access run data and compare runs


Here comes the fun part of accessing your run data on Kedro-Viz. Having ensured that you are using Kedro-Viz `>=4.1.1` (you can confirm your Kedro-Viz version by running `kedro info`), run:

```bash
kedro viz
```

When you open the Kedro-Viz webapp, you will see an `experiment tracking` icon on your left. Clicking the icon will bring you to the experiment tracking page (you can also access the page via `http://127.0.0.1:4141/runsList`), where you will now see the set of experiment data generated from your previous runs, as shown below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a picture of the icon? I think Rashida did something like that in the plotly docs.

Copy link
Contributor Author

@studioswong studioswong Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing - I've added the icon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need a black version of the icon 😅
Screenshot 2022-01-14 at 11 21 08

Copy link
Contributor Author

@studioswong studioswong Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo yea good spot - I actually use dark mode on github and completely forgotten about that!

This reminds me that perhaps I should replace with a png with background color intead - I'll update that.

P.S I had a quick glance and the different github modes might affect the icon in the plotly doc under dark mode 👇 - @rashidakanchwala think you might want to have a look too.
image

Copy link
Contributor Author

@studioswong studioswong Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with Rashida, I have updated the icon in the plotly docs as well in my latest commit 😉 let me know if you spot anything else with this icon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, yes this looks much better!


![](../meta/images/experiment-tracking_runsList.png)

You will now be able to access, compare and pin your runs by toggling the `Compare runs` button, as shown below:

![](../meta/images/experiment-tracking_demo.gif)

The Kedro-Viz team will be adding new features in the coming weeks, such as allowing the editing your run title, adding notes, bookmarking and searching your runs, etc. Definitely keep an eye out on the [Kedro-Viz release page](https://github.com/quantumblacklabs/kedro-viz/releases) for the upcoming releases.
merelcht marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 2 additions & 0 deletions docs/source/08_logging/02_experiment_tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ However, Kedro was missing a way to log metrics and capture all this logged data

Experiment tracking in Kedro adds in the missing pieces and will be developed incrementally.

The following section outlines the setup within your Kedro project to enable experiment tracking. You can also refer to [this tutorial](../03_tutorial/07_set_up_experiment_tracking.md) for a step by step process to access your tracking datasets on Kedro-Viz.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following section outlines the setup within your Kedro project to enable experiment tracking. You can also refer to [this tutorial](../03_tutorial/07_set_up_experiment_tracking.md) for a step by step process to access your tracking datasets on Kedro-Viz.
The following section outlines the setup within your Kedro project to enable experiment tracking. You can also refer to [this tutorial](../03_tutorial/07_set_up_experiment_tracking.md) for a step-by-step process to access your tracking datasets on Kedro-Viz.


## Enable experiment tracking
Use either one of the [`tracking.MetricsDataSet`](/kedro.extras.datasets.tracking.MetricsDataSet) or [`tracking.JSONDataSet`](/kedro.extras.datasets.tracking.JSONDataSet) in your data catalog. These datasets are versioned by default to ensure a historical record is kept of the logged data.
The `tracking.MetricsDataSet` should be used for tracking numerical metrics and the `tracking.JSONDataSet` can be used for tracking any other JSON-compatible data. In Kedro-Viz these datasets will be visualised in the metadata side panel.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.