[KED-2917] Set up experiment tracking tutorial (kedro-org#1144)

* set up experiment tracking tutorial * added release note * first set of comments * second set of comments * third round of comments * addition of demo gif in intro section * minor wording changes * further minor changes * update wording changes and added experiment icon * minor wording change * minor formatting change * further wording changes and adding new doc to index.rst * update experiment tracking icon * updated release notes, edited plotly icon * rename icon * lint changes * futher minor wording changes * further minor change Signed-off-by: Laurens Vijnck <laurens_vijnck@mckinsey.com>
lvijnck · Apr 7, 2022 · 34f8738 · 34f8738
1 parent 49082f1
commit 34f8738
Show file tree

Hide file tree

Showing 11 changed files with 173 additions and 2 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -1,3 +1,9 @@
+# Release 0.17.7
+
+## Bug fixes and other changes
+* Added tutorial documentation for experiment tracking in Kedro docs. (`03_tutorial/07_set_up_experiment_tracking.md`).
+* Added Plotly documentation in Kedro docs. (`03_tutorial/06_visualise_pipeline.md`).
+
 # Release 0.17.6
 
 ## Major features and improvements
@@ -13,7 +19,6 @@
 | `pandas.GenericDataSet` | Provides a 'best effort' facility to read / write any format provided by the `pandas` library | `kedro.extras.datasets.pandas` |
 | `pandas.GBQQueryDataSet` | Loads data from a Google Bigquery table using provided SQL query | `kedro.extras.datasets.pandas` |
 | `spark.DeltaTableDataSet` | Dataset designed to handle Delta Lake Tables and their CRUD-style operations, including `update`, `merge` and `delete` | `kedro.extras.datasets.spark` |
-* Added the Plotly documentation on Kedro docs.
 
 ## Bug fixes and other changes
 * Fixed an issue where `kedro new --config config.yml` was ignoring the config file when `prompts.yml` didn't exist.

diff --git a/docs/source/03_tutorial/06_visualise_pipeline.md b/docs/source/03_tutorial/06_visualise_pipeline.md
@@ -200,7 +200,7 @@ shuttle_speed_comparison_plot:
   filepath: data/08_reporting/shuttle_speed_comparison_plot.json
 ```
 
-Once the above setup is completed, you can do a `kedro run` followed by `kedro viz` and your Kedro-Viz pipeline will show a new dataset type with icon ![](../meta/images/icon-image-dataset.svg) . Once you click on the node, you can see a small preview of your Plotly chart in the metadata panel.
+Once the above setup is completed, you can do a `kedro run` followed by `kedro viz` and your Kedro-Viz pipeline will show a new dataset type with icon ![](../meta/images/plotly-icon.png) . Once you click on the node, you can see a small preview of your Plotly chart in the metadata panel.
 
 ![](../meta/images/pipeline_visualisation_plotly.png)
 

diff --git a/docs/source/03_tutorial/07_set_up_experiment_tracking.md b/docs/source/03_tutorial/07_set_up_experiment_tracking.md
@@ -0,0 +1,163 @@
+# Set up experiment tracking
+
+Experiment tracking is the process of saving all machine-learning related experiment information so that it is easy to find and compare past runs. [Kedro-Viz](https://github.com/quantumblacklabs/kedro-viz) supports native experiment tracking from [version 4.1.1](https://github.com/quantumblacklabs/kedro-viz/releases/tag/v4.1.1) onwards. When experiment tracking is enabled in your Kedro project, you will be able to access, edit and compare your experiments directly from the Kedro-Viz web app.
+
+![](../meta/images/experiment-tracking_demo_small.gif)
+
+Enabling experiment tracking features on Kedro-Viz relies on:
+* [setting up a session store to capture experiment metadata](#set-up-session-store),
+* [experiment tracking datasets to let Kedro know what metrics should be tracked](#set-up-tracking-datasets)
+* [modifying your nodes and pipelines to output those metrics](#setting-up-your-nodes-and-pipelines-to-log-metrics).
+
+This tutorial will provide a step-by-step process to set up experiment tracking and access your logged metrics from each run on Kedro-Viz. It will use the spaceflights starter project that is outlined in [this tutorial](../03_tutorial/01_spaceflights_tutorial.md). You can also jump directly to [this section for direct reference in setting up experiment tracking](../08_logging/02_experiment_tracking.md) for your Kedro project.
+
+You can also access a more detailed demo [here](https://kedro-viz-live-demo.hfa4c8ufrmn4u.eu-west-2.cs.amazonlightsail.com/).
+
+## Set up a project
+
+We assume that you have already [installed Kedro](../02_get_started/02_install.md) and [Kedro-Viz](../03_tutorial/06_visualise_pipeline.md). Set up a new project using the spaceflights starter by running:
+
+```bash
+kedro new --starter=spaceflights
+```
+
+Feel free to name your project as you like, but this guide will assume the project is named **Kedro Experiment Tracking Tutorial**, and that your project is in a sub-folder in your working directory that was created by `kedro new`, named `kedro-experiment-tracking-tutorial`. Keep the default names for the `repo_name` and `python_package` when prompted by pressing the enter key.
+
+## Set up the session store
+
+In the domain of experiment tracking, each pipeline run is considered a session. A session store records all related metadata for each pipeline run, from logged metrics to other run-related data such as timestamp, git username and branch. The session store is a [SQLite](https://www.sqlite.org/index.html) database that gets generated during your first pipeline run after it has been set up in your project.
+
+To set up the session store, go to the `src/settings.py` file and add the following:
+
+```python
+from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
+from pathlib import Path
+
+SESSION_STORE_CLASS = SQLiteStore
+SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / "data")}
+```
+
+This will specify the creation of the `SQLiteStore` under the `/data` subfolder, using the `SQLiteStore` setup from your installed Kedro-Viz plugin.
+
+Please ensure that your installed version of Kedro-Viz is at least version 4.1.1 onwards. This step is crucial for enabling experiment tracking features on Kedro-Viz as it is the database used to serve all run data to the Kedro-Viz front-end. Once this step is complete you can either proceed to [set up the tracking datasets](#set-up-tracking-datasets) or [set up your nodes and pipelines to log metrics](#set-up-your-nodes-and-pipelines-to-log-metrics); these two activities are interchangeable.
+
+## Set up tracking datasets
+
+There are two types of tracking datasets: [`tracking.MetricsDataSet`](/kedro.extras.datasets.tracking.MetricsDataSet) and [`tracking.JSONDataSet`](/kedro.extras.datasets.tracking.JSONDataSet). The `tracking.MetricsDataSet` should be used for tracking numerical metrics, and the `tracking.JSONDataSet` can be used for tracking any other JSON-compatible data like boolean or text-based data.
+
+Set up two datasets to log `r2 scores` and `parameters` for each run by adding the following in the `conf/base/catalog.yml` file:
+
+```yaml
+metrics:
+  type: tracking.MetricsDataSet
+  filepath: data/09_tracking/metrics.json
+
+companies_columns:
+  type: tracking.JSONDataSet
+  filepath: data/09_tracking/companies_columns.json
+```
+
+## Set up your nodes and pipelines to log metrics
+
+Now that you have set up the tracking datasets to log experiment tracking data, the next step is to ensure that the data is returned from your nodes.
+
+Set up the data to be logged for the metrics dataset - under `nodes.py` of your `data_processing` pipeline (`/src/kedro-experiment-tracking-tutorial/pipelines/data_processing/nodes.py`), modify your `evaluate_model` function by adding in three different metrics: `score` to log your r2 score, `mae` to log your mean absolute error, and `me` to log your max error, and returning those 3 metrics as a key value pair.
+
+The new `evaluate_model` function would look like this:
+
+```python
+def evaluate_model(
+    regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series
+) -> Dict[str, float]:
+    """Calculates and logs the coefficient of determination.
+
+    Args:
+        regressor: Trained model.
+        X_test: Testing data of independent features.
+        y_test: Testing data for price.
+    """
+    y_pred = regressor.predict(X_test)
+    score = r2_score(y_test, y_pred)
+    mae = mean_absolute_error(y_test, y_pred)
+    me = max_error(y_test, y_pred)
+    logger = logging.getLogger(__name__)
+    logger.info("Model has a coefficient R^2 of %.3f on test data.", score)
+    return {"r2_score": score, "mae": mae, "max_error": me}
+```
+
+The next step is to ensure that the dataset is also specified as an output of your `evaluate_model` node. Under `src/kedro-experiment-tracking-tutorial/pipelines/data_processing/pipeline.py`, specify the `output` of your `evaluate_model` to be the `metrics` dataset. Note that it is crucial that the output dataset exactly matches the name of the tracking dataset specified in the catalog file.
+
+The node of the `evaluate_model` on the pipeline should look like this:
+
+```python
+node(
+    func=evaluate_model,
+    inputs=["regressor", "X_test", "y_test"],
+    name="evaluate_model_node",
+    outputs="metrics",
+)
+```
+
+You have to repeat the same steps for setting up the `companies_column` dataset. For this dataset you should log the column that contains the list of companies as outlined in `companies.csv` under `/data/01_raw`. Modify the `preprocess_companies` node under the `data_processing` pipeline (`src/kedro-experiment-tracking-tutorial/pipelines/data_processing/nodes.py`) to return the data under a key value pair, as shown below:
+
+```python
+def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
+    """Preprocesses the data for companies.
+
+    Args:
+        companies: Raw data.
+    Returns:
+        Preprocessed data, with `company_rating` converted to a float and
+        `iata_approved` converted to boolean.
+    """
+    companies["iata_approved"] = _is_true(companies["iata_approved"])
+    companies["company_rating"] = _parse_percentage(companies["company_rating"])
+    return companies, {"columns": companies.columns.tolist(), "data_type": "companies"}
+```
+
+Again, you will need to ensure that the dataset is also specified as an output on `pipeline.py` under the `data_processing` pipeline (`src/kedro-experiment-tracking-tutorial/pipelines/data_processing/pipeline.py`), as follows:
+
+```python
+node(
+    func=preprocess_companies,
+    inputs="companies",
+    outputs=["preprocessed_companies", "companies_columns"],
+    name="preprocess_companies_node",
+)
+```
+
+Having set up both datasets, you are now ready to generate your first set of experiment tracking data!
+
+## Generate the Run data
+
+The beauty of native experiment tracking in Kedro is that all tracked data is generated and stored each time you do a Kedro run. Hence, to generate the data, you only need to execute:
+
+```bash
+kedro run
+```
+
+After the run completes, under `data/09_tracking`, you will now see two folders, `companies_column.json` and `metrics.json`. On performing a pipeline run after setting up the tracking datasets, Kedro will generate a folder with the dataset name for each tracked dataset. Each folder of the tracked dataset will contain folders named by the timestamp of each pipeline run to store the saved metrics of the dataset, with each future pipeline run generating a new timestamp folder with the JSON file of the saved metrics under the folder of its subsequent tracked dataset.
+
+You will also see the `session_store.db` generated from your first pipeline run after enabling experiment tracking, which is used to store all the generated run metadata, alongside the tracking dataset, to be used for exposing experiment tracking to Kedro-Viz.
+
+![](../meta/images/experiment-tracking_folder.png)
+
+Try to execute `kedro run` a few times to generate a larger set of experiment data. You can also play around with setting up different tracking datasets, and check the logged data via the generated JSON data files.
+
+## Access run data and compare runs
+
+Here comes the fun part of accessing your run data on Kedro-Viz. Having ensured that you are using Kedro-Viz `>=4.1.1` (you can confirm your Kedro-Viz version by running `kedro info`), run:
+
+```bash
+kedro viz
+```
+
+When you open the Kedro-Viz web app, you will see an experiment tracking icon ![](../meta/images/experiment-tracking-icon.png) on your left. Clicking the icon will bring you to the experiment tracking page (you can also access the page via `http://127.0.0.1:4141/runsList`), where you will now see the set of experiment data generated from your previous runs, as shown below:
+
+![](../meta/images/experiment-tracking_runsList.png)
+
+You will now be able to access, compare and pin your runs by toggling the `Compare runs` button, as shown below:
+
+![](../meta/images/experiment-tracking_demo.gif)
+
+Keep an eye out on the [Kedro-Viz release page](https://github.com/quantumblacklabs/kedro-viz/releases) for the upcoming releases on this experiment tracking functionality.
diff --git a/docs/source/08_logging/02_experiment_tracking.md b/docs/source/08_logging/02_experiment_tracking.md
@@ -10,6 +10,8 @@ However, Kedro was missing a way to log metrics and capture all this logged data
 
 Experiment tracking in Kedro adds in the missing pieces and will be developed incrementally.
 
+The following section outlines the setup within your Kedro project to enable experiment tracking. You can also refer to [this tutorial](../03_tutorial/07_set_up_experiment_tracking.md) for a step-by-step process to access your tracking datasets on Kedro-Viz.
+
 ## Enable experiment tracking
 Use either one of the [`tracking.MetricsDataSet`](/kedro.extras.datasets.tracking.MetricsDataSet) or [`tracking.JSONDataSet`](/kedro.extras.datasets.tracking.JSONDataSet) in your data catalog. These datasets are versioned by default to ensure a historical record is kept of the logged data.
 The `tracking.MetricsDataSet` should be used for tracking numerical metrics and the `tracking.JSONDataSet` can be used for tracking any other JSON-compatible data. In Kedro-Viz these datasets will be visualised in the metadata side panel.

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -75,6 +75,7 @@ Welcome to Kedro's documentation!
    03_tutorial/04_create_pipelines
    03_tutorial/05_package_a_project
    03_tutorial/06_visualise_pipeline
+   03_tutorial/07_set_up_experiment_tracking
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/source/meta/images/experiment-tracking-icon.png b/docs/source/meta/images/experiment-tracking-icon.png
diff --git a/docs/source/meta/images/experiment-tracking_demo.gif b/docs/source/meta/images/experiment-tracking_demo.gif
diff --git a/docs/source/meta/images/experiment-tracking_demo_small.gif b/docs/source/meta/images/experiment-tracking_demo_small.gif
diff --git a/docs/source/meta/images/experiment-tracking_folder.png b/docs/source/meta/images/experiment-tracking_folder.png
diff --git a/docs/source/meta/images/experiment-tracking_runsList.png b/docs/source/meta/images/experiment-tracking_runsList.png
diff --git a/docs/source/meta/images/plotly-icon.png b/docs/source/meta/images/plotly-icon.png