-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick-start for ranking with Merlin Models #915
Conversation
Documentation preview |
--timestamp_feature >= value | ||
``` | ||
|
||
### CUDA cluster options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed, I think this should be optional, and users should be able to use CPU as well, which they can prefer with a small dataset. This is the case for the recsys23 competition dataset, it is not that big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the preprocess.py script to detect if there are GPU available, otherwise it configures Dataset(..., cpu=True)
. But when testing this setting in the Merlin TF container 23.02 without GPUs available it raised some errors when importing NVTabular due to a known issue with cuda-python 11.7.0 and earlier (used by cudf). According to @oliverholworthy, it seems we won't have this issue with the release 23.04
because we’re using a more recent version of cudf and cuda-python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to that, I created an option --enable_dask_cuda_cluster
in the preprocess.py script to enable/disable the usage of Dask Cluster, as for smaller datasets not using LocalCUDACluster might be faster
|
||
### Inputs | ||
``` | ||
--data_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if one already split train and val sets at their end up front, in that case, data-path would be train set
path right? and to avoid any further split, users should also set --dataset_split_strategy
to None?. or it is default None? If it is default None, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the default --dataset_split_strategy
is None. You the eval and test set were already split, you just need to provide them in --eval_data_path
and --test_data_path
@@ -0,0 +1,131 @@ | |||
# Quick-start for ranking models with Merlin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be there, but maybe renamed to something like ranking.md
?
The idea is that we gonna add other quick-start documents next, like Quick-start for session-based recommendation, for retrieval, for Two/Multiple Stages.
In that case, I think that should be a README.md
that would work as an index for our quick-starts. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gabrielspmoreira One think I think we can improve is the prediction step. I tested the script you shared with me for prediction but it retrains the model.. but is there a prediction script that user can feed the saved model path and then do the batch predict automatically without training again? It'd be better if we can provide an example code snippet how one can do the prediction. |
Indeed. Following your suggestion, I made it possible to save the trained model with |
…failing in that case
… trained models and generating preds without retraining
14884b3
to
20c7557
Compare
…ded, but just --predict_data_path
* Moving quick-start for ranking from models repo to Merlin repo * Updating quick-start doc and gitignore * Remove outputs from ranking script * Created tutorial of hpo with Quick-Start and W&B sweeps. Refined docs * Added option to run preprocessing using CPU. But NVTabular import is failing in that case * Discovering automatically if GPUs are available in preprocessing script * Refined docs on hypertuning * Refactored CLI args for preproc and ranking to better support loading trained models and generating preds without retraining * Adjustments in the markdown documentation * Having quick-start dynamic args to support space separated command line keys and values * Fix on the arg parsing of quick-start ranking training * Raising an exception when a target not found in the schema is provided * Additional fixes in the documentation * Fixed an issue when on --train_data_path or --eval_data_path is provided, but just --predict_data_path * Printing the folder where prediction file will be saved * Printing the folder where prediction file will be saved
Fixes #916 , fixes #986 , fixes #918, fixes #680, fixes #681, fixes #666
Goals ⚽
This PR introduces a quick-start example for preprocessing, training, evaluating and deploying ranking models.
It is composed by a set of scripts and markdown documents. We use in the example the TenRec dataset, but the scripts are generic and can be used with customer own data, provided that they have the right shape: positive and potentially negative user-item events with tabular features.
Implementation Details 🚧
preprocessing.py
- Generic script for preprocessing with CLI arguments for preprocessing a raw dataset (CSV or parquet) with NVTabular. It contains arguments to configure input path and format, categorical and continuous features, configuring the features tagging (user_id, item_id, ...), to filter interactions by using min/max frequency for users or items and dataset split.Example command line for TenRec dataset:
ranking_train_eval.py
- Generic script for training and evaluation of ranking models. It takes the preprocessed dataset frompreprocessing.py
and schema as input. You can set many different training and model hparams for train both single-task learning (MLP, DCN, DLRM, Wide&Deep, DeepFM) and multi-task learning specific models (e.g. MMOE, CGC, PLE).Testing Details 🔍
Tasks
Implementation
PredictionBlock
instead ofPredictionTask
) #917preprocessing.py
to provide additional dataset split strategies (e.g.random_by_user
,temporal
).preprocessing.py
to use Dask Distributed client for preprocessing larger/full dataset (single or multiple GPU)Experimentation
Documentation
You can check the Quick-start for ranking documentation starting from this main page