Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick-start for ranking with Merlin Models #915

Merged
merged 16 commits into from
Apr 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
bf05dee
Moving quick-start for ranking from models repo to Merlin repo
gabrielspmoreira Apr 19, 2023
419d1b7
Updating quick-start doc and gitignore
gabrielspmoreira Apr 19, 2023
b45b7a5
Remove outputs from ranking script
gabrielspmoreira Apr 20, 2023
a83d700
Created tutorial of hpo with Quick-Start and W&B sweeps. Refined docs
gabrielspmoreira Apr 20, 2023
143a1e2
Added option to run preprocessing using CPU. But NVTabular import is …
gabrielspmoreira Apr 20, 2023
641515c
Discovering automatically if GPUs are available in preprocessing script
gabrielspmoreira Apr 20, 2023
aa99d10
Refined docs on hypertuning
gabrielspmoreira Apr 21, 2023
7cd76cd
Refactored CLI args for preproc and ranking to better support loading…
gabrielspmoreira Apr 21, 2023
20c7557
Adjustments in the markdown documentation
gabrielspmoreira Apr 24, 2023
649b4e3
Having quick-start dynamic args to support space separated command li…
gabrielspmoreira Apr 24, 2023
861a76f
Fix on the arg parsing of quick-start ranking training
gabrielspmoreira Apr 24, 2023
92219e8
Raising an exception when a target not found in the schema is provided
gabrielspmoreira Apr 24, 2023
c01f3bf
Additional fixes in the documentation
gabrielspmoreira Apr 24, 2023
86f9b4f
Fixed an issue when on --train_data_path or --eval_data_path is provi…
gabrielspmoreira Apr 24, 2023
7fb5d57
Printing the folder where prediction file will be saved
gabrielspmoreira Apr 24, 2023
a10e668
Printing the folder where prediction file will be saved
gabrielspmoreira Apr 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,10 @@ __pycache__/
/.*_checkpoints/
.ipynb_checkpoints/*
*/.jupyter/*

# IDEs
.vscode/

# Quick-start log folders
wandb/
tb_logs/
18 changes: 18 additions & 0 deletions examples/quick_start/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## Quick-start for Merlin

This folder contains quick-start guides for Recommender System use cases. It provides scripts, powered by Merlin libraries, and best practices documentation that are designed to to allow you go quickly from a dataset of user interactions through preprocessing, training, evaluation and deployment with Triton.

We iterate in these guides over a typical Data Science process which involves data preprocessing / feature engineering, model training and evaluation and hyperparameter tuning.
<center>
<img src="images/quick_start_process.png" alt="Data science process with Merlin" >
</center>

We provide examples on the usage of those scripts with public datasets, but they are meant to be used with your own dataset without much effort.

Here are the quick-start guides currently available:
- [Quick-start for Ranking](./ranking.md)

We are going to work on next releases to add:
- Quick-start for Session-based recommendation
- Quick-start for Retrieval
- Quick-start for Multi-Stage RecSys
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/quick_start/images/mtl_benchmark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/quick_start/images/stl_benchmark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/quick_start/images/tenrec_dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/quick_start/images/wandb_sweeps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
134 changes: 134 additions & 0 deletions examples/quick_start/ranking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Quick-start for ranking models with Merlin
**Do you want to get the best possible accuracy for your ranking problem?**
**This guide will teach you best practices on preprocessing data, training and hypertuning ranking models with Merlin.**

We iterate in this guide over a typical Data Science process which involves data preprocessing / feature engineering, model training and evaluation and hyperparameter tuning.
<center>
<img src="images/quick_start_process.png" alt="Data science process with Merlin" >
</center>

We use as example here the [TenRec dataset](https://static.qblv.qq.com/qblv/h5/algo-frontend/tenrec_dataset.html), which is large (140 million positive interactions from 5 million users), contains explicit negative feedback (items exposed to the user and not interacted) and multiple target columns (click, like, share, follow).

## Setup
You can run these scripts either using the latest Merlin TensorFlow image or installing the necessary Merlin libraries according to their documentation ([core](https://github.com/NVIDIA-Merlin/core), [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular), [dataloader](https://github.com/NVIDIA-Merlin/dataloader), [models](https://github.com/NVIDIA-Merlin/models/)).
In this doc we provide the commands for setting up a Docker container for Merlin TensorFlow, as it provides all necessary libraries already installed.

### Download the TenRec dataset
You can find the TenRec dataset in this [link](https://static.qblv.qq.com/qblv/h5/algo-frontend/tenrec_dataset.html). You might switch the page language to *English* in the top-right link, if you prefer.
To be able to download the data, you need first to agree with the terms an register your e-mail. After downloading the zipped file (4.2 GB) you just need to uncompress the data.


## Preparing the data
The TenRec dataset contains a number of CSV files. We will be using the `QK-video.csv`, which logs user interactions with different videos.

Here is an example on how the data looks like. For ranking models, you typically have user, item and contextual features and one or more targets, that can be a binary (e.g. has the customer clicked or liked and item) or regression target (e.g. watch times).

![TenRec dataset structure](images/tenrec_dataset.png)

As `QK-video.csv` has a reasonable size (~15 GB with ~493 million examples), feel free to reduce it for less rows you want to test the pipeline more quickly or if you don't have a powerful GPU available (V100 with 32 GB or better). For example, with the following command you can truncate the file keeping the first 10 million rows (header line included).

```bash
head -n 10000001 QK-video.csv > QK-video-10M.csv
```

### Start Docker container

1. Pull the latest [Merlin TensorFlow image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow).

```bash
docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:latest
```

2. Set `INPUT_DATA_PATH` variable to the folder where `QK-video.csv` was saved.
The `OUTPUT_PATH` is the place where the preprocessed dataset and trained model will be saved.

```bash
INPUT_DATA_PATH=/path/to/input/dataset/
OUTPUT_PATH=/path/to/output/path/
```

3. Start a Merlin TensorFlow container in interactive mode
```bash
docker run --gpus all --rm -it --ipc=host -v $INPUT_DATA_PATH:/data -v $OUTPUT_PATH:/outputs \
nvcr.io/nvidia/merlin/merlin-tensorflow:latest /bin/bash
```

4. Inside the container, go to `/Merlin/examples/quick_start` folder and install the Quick-start dependencies.
```bash
cd /Merlin/examples/quick_start
pip install -r requirements.txt
```


## Preprocessing

In order to make it easy getting the data ready for model training, we provide a generic script: [preprocessing.py](scripts/preproc/preprocessing.py). That script is based on **dask_cudf** and **NVTabular** libraries that leverage GPUs for accelerated and distributed preprocessing.
P.s. **NVTabular** also supports CPU which is suitable for prototyping in dev environments.

The preprocessing script outputs preprocessed data as a number of parquet files, as well as a *schema* that stores output features metadata like statistics and tags.

In this example, we set some options for preprocessing. Here is the explanation of the main arguments; you can check the [full documentation and best practices for preprocessing](scripts/preproc/README.md).

- `--categorical_features` - Names of the categorical/discrete features (concatenated with "," without space).
- `--binary_classif_targets` - Names of the available target columns for binary classification task
- `--regression_targets` - Names of the available target columns for regression task
- `--user_id_feature` - Name of the user id feature
- `--item_id_feature` - Name of the item id feature
- `--to_int32`, `--to_int16`, `--to_int8` - Allows type casting the columns to the lower possible precision, which may avoid memory issues with large datasets.
- `--dataset_split_strategy` - Strategy for splitting train and eval sets. In this case, `random_by_user` is chosen, which means that train and test will have the same users with some random examples reserved for evaluation.
- `--random_split_eval_perc` - Percentage of data to reserve for eval set
- `--filter_query` - A filter query condition compatible with dask-cudf `DataFrame.query()`. In this example, we keep only examples were all targets are 0 or where click=1, to remove examples where we have other targets equals to 1, but click = 0.
- `--min_item_freq`, `--min_user_freq`, `max_user_freq` - Filters out training examples from users or items based on their min or max frequency threshold

For larger dataset (like the full TenRec dataset), in particular when using filtering options that require dask_cudf filtering (e.g. `--filter_query`, `--min_item_freq`) we recommend using the following options to avoid out-of-memory errors:
- `--enable_dask_cuda_cluster` - Initializes a dask-cudf `LocalCUDACluster` for managed single or multi-GPU preprocessing
- `--persist_intermediate_files` - Persists/caches to disk intermediate files during preprocessing (in paricular after filtering).


```bash
cd /Merlin/examples/quick_start/scripts/preproc/
OUT_DATASET_PATH=/outputs/dataset
python preprocessing.py --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
```

After you execute this script, a folder `dataset` will be created in `--output_path` with the preprocessed datasets , with `train` and `eval` folders. You will find a number of partitioned parquet files inside those dataset folders, as well as a `schema.pbtxt` file produced by `NVTabular` which is very important for automated model building in the next step.

## Training a ranking model
Merlin Models is a Merlin library that makes it easy to build and train RecSys models. It is built on top of TensorFlow, and provides building blocks for creating input layers based on the features in the schema, different feature interaction layers and output layers based on the target columns defined in the schema.

A number of popular ranking models are available in Merlin Models API like **DLRM**, **DCN-v2**, **Wide&Deep**, **DeepFM**. This Quick-start provides a generic ranking script [ranking.py](scripts/ranking/ranking.py) for building and training those models using Models API.

In the following command example, you can easily train the popular **DLRM** model which performs 2nd level feature interaction. It sets `--model dlrm` and `--embeddings_dim 64` because DLRM models require all categorical columns to be embedded with the same dimension for the feature interaction. You notice that we can set many of the common model (e.g. top `--mlp_layers`) and training hyperparameters like learning rate (`--lr`) and its decay (`--lr_decay_rate`, `--lr_decay_steps`), L2 regularization (`--l2_reg`, `embeddings_l2_reg`), `--dropout` among others. We set `--epochs 1` and `--train_steps_per_epoch 10` to train for just 10 batches and make runtime faster. If you have a GPU with more memory (e.g. V100 with 32 GB), you might increase `--train_batch_size` and `--eval_batch_size` to a much larger batch size, for example to `65536`.
There are many target columns available in the dataset, and you can select one of them for training by setting `--tasks=click`. In this dataset, there are about 3.7 negative examples (`click=0`) for each positive example (`click=1`). That leads to some class unbalance. We can deal with that by setting `--stl_positive_class_weight` to give more weight to the loss for positive examples, which are rarer.


```bash
cd /Merlin/examples/quick_start/scripts/ranking/
OUT_DATASET_PATH=/outputs/dataset
CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python ranking.py --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 3 --model dlrm --embeddings_dim 64 --l2_reg 1e-4 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32 --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 1 --train_steps_per_epoch 10
```
You can explore the [full documentation and best practices for ranking models](scripts/ranking/README.md), which contains details about the command line arguments.

## Training a ranking model with multi-task learning
When multiple targets are available for the same features, models typically benefit from joint training a single model with multiple heads / losses. Merlin Models supports some architectures designed specifically for multi-task learning based on experts. You can find an example notebook with detailed explanations [here](https://github.com/NVIDIA-Merlin/models/blob/main/examples/usecases/ranking_with_multitask_learning.ipynb).


The [ranking.py](scripts/ranking/ranking.py) script makes it easy to train ranking models with multi-task learning by setting more than one target, e.g. `--tasks="click,like,follow,share"`).


### Training an MMOE model
In the following example, we use the popular **MMOE** (`--model mmoe`) architecture for multi-task learning. It creates independent expert sub-networks (as defined by `--mmoe_num_mlp_experts`, `--expert_mlp_layers`) that interacts independently the input features. Each task has a gate with `--gate_dim` that averages the expert outputs based on learned softmax weights, so that each task can harvest relevant information for its predictions. Each task might also have an independent tower by setting `--tower_layers`.
You can also balance the loss weights by setting `--mtl_loss_weight_*` arguments and the tasks positive class weight by setting `--mtl_pos_class_weight_*`.

```bash
cd /Merlin/examples/quick_start/scripts/ranking/

CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python ranking.py --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click,like,follow,share --model mmoe --mmoe_num_mlp_experts 3 --expert_mlp_layers 128 --gate_dim 32 --use_task_towers=True --tower_layers 64 --embedding_sizes_multiplier 4 --l2_reg 1e-5 --embeddings_l2_reg 1e-6 --dropout 0.05 --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 1 --mtl_pos_class_weight_click=1 --mtl_pos_class_weight_like=2 --mtl_pos_class_weight_share=3 --mtl_pos_class_weight_follow=4 --mtl_loss_weight_click=3 --mtl_loss_weight_like=3 --mtl_loss_weight_follow=1 --mtl_loss_weight_share=1 --train_steps_per_epoch 10
```

You can find more quick-start information on multi-task learning and MMOE architecture [here](scripts/ranking/README.md).

## Hyperparameter tuning
We provide a [tutorial](scripts/ranking/hypertuning/tutorial_with_wb_sweeps.md) on how to do **hyperparameter tuning with Merlin models and Weights&Biases Sweeps**.

We also make it available a [benchmark](scripts/ranking/hypertuning/README.md) resulted from our own hyperparameter tuning of TenRec dataset. It compares the different single-task and multi-task learning models. It provides also empirical information on what were the improvements obtained with hyperparameter tuning, the curated hypertuning search space for modeling hyperparameters of `ranking.py` and the most important hyperparameters.
1 change: 1 addition & 0 deletions examples/quick_start/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
wandb
Loading