Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick-start example for preprocessing, training and deploying ranking models #988

Closed
wants to merge 28 commits into from

Conversation

gabrielspmoreira
Copy link
Member

@gabrielspmoreira gabrielspmoreira commented Feb 14, 2023

Fixes NVIDIA-Merlin/Merlin#916 , fixes NVIDIA-Merlin/Merlin#917 , fixes NVIDIA-Merlin/Merlin#918, fixes #680, fixes #681, fixes #666

Goals ⚽

This PR introduces a quick-start example for preprocessing, training, evaluating and deploying ranking models.
It is composed by a set of scripts and markdown documents. We use in the example the TenRec dataset, but the scripts are generic and can be used with customer own data, provided that they have the right shape: positive and potentially negative user-item events with tabular features.

Implementation Details 🚧

  • preprocessing.py - Generic script for preprocessing with CLI arguments for preprocessing a raw dataset (CSV or parquet) with NVTabular. It contains arguments to configure input path and format, categorical and continuous features, configuring the features tagging (user_id, item_id, ...), to filter interactions by using min/max frequency for users or items and dataset split.
    Example command line for TenRec dataset:
python preprocessing.py --input_data_format=csv --csv_na_values=\\N --input_data_path /data/QK-video.csv --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --min_user_freq 5 --persist_intermediate_files --dataset_split_strategy=random --random_split_eval_perc=0.2 	
  • ranking_train_eval.py - Generic script for training and evaluation of ranking models. It takes the preprocessed dataset from preprocessing.py and schema as input. You can set many different training and model hparams for train both single-task learning (MLP, DCN, DLRM, Wide&Deep, DeepFM) and multi-task learning specific models (e.g. MMOE, CGC, PLE).
python  ranking_train_eval.py --train_path $OUT_DATASET_PATH/final_dataset/train --eval_path $OUT_DATASET_PATH/final_dataset/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 4 --model dlrm --embeddings_dim 64 --l2_reg 1e-5 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32  --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 4096 --eval_batch_size 4096 --epochs 1 --train_steps_per_epoch 10 

Testing Details 🔍

  • The preprocessing and training ranking scripts are going to be added as integration tests.

Tasks

Implementation

Experimentation

Documentation

Deployment and inference with Triton

Testing

@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-988

@gabrielspmoreira
Copy link
Member Author

Closing this PR, as it was moved to Merlin Repo: NVIDIA-Merlin/Merlin#915

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment