-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick-start example for preprocessing, training and deploying ranking models #988
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4 tasks
Documentation preview |
19 tasks
gabrielspmoreira
force-pushed
the
tf/quick_start_ranking
branch
from
March 17, 2023 14:47
9df52a6
to
026ffe2
Compare
gabrielspmoreira
force-pushed
the
tf/quick_start_ranking
branch
2 times, most recently
from
April 14, 2023 17:05
32ec264
to
c5fb270
Compare
…Distributed support for larger datasets and also two additional sampling strategies: random_per_user and temporal
…test sets. It now also apply targets transformation only for train and eval, but not for test. Adjusted also arg names for temporal dataset split
…ing to file. Parsing properly list CLI args
…ion), support to TSV as input, options to fill null with a median or a value
gabrielspmoreira
force-pushed
the
tf/quick_start_ranking
branch
from
April 19, 2023 21:15
6948312
to
89dc479
Compare
8 tasks
Closing this PR, as it was moved to Merlin Repo: NVIDIA-Merlin/Merlin#915 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes NVIDIA-Merlin/Merlin#916 , fixes NVIDIA-Merlin/Merlin#917 , fixes NVIDIA-Merlin/Merlin#918, fixes #680, fixes #681, fixes #666
Goals ⚽
This PR introduces a quick-start example for preprocessing, training, evaluating and deploying ranking models.
It is composed by a set of scripts and markdown documents. We use in the example the TenRec dataset, but the scripts are generic and can be used with customer own data, provided that they have the right shape: positive and potentially negative user-item events with tabular features.
Implementation Details 🚧
preprocessing.py
- Generic script for preprocessing with CLI arguments for preprocessing a raw dataset (CSV or parquet) with NVTabular. It contains arguments to configure input path and format, categorical and continuous features, configuring the features tagging (user_id, item_id, ...), to filter interactions by using min/max frequency for users or items and dataset split.Example command line for TenRec dataset:
ranking_train_eval.py
- Generic script for training and evaluation of ranking models. It takes the preprocessed dataset frompreprocessing.py
and schema as input. You can set many different training and model hparams for train both single-task learning (MLP, DCN, DLRM, Wide&Deep, DeepFM) and multi-task learning specific models (e.g. MMOE, CGC, PLE).Testing Details 🔍
Tasks
Implementation
PredictionBlock
instead ofPredictionTask
) Merlin#917preprocessing.py
to provide additional dataset split strategies (e.g.random_by_user
,temporal
).preprocessing.py
to use Dask Distributed client for preprocessing larger/full dataset (single or multiple GPU)Experimentation
Documentation
Deployment and inference with Triton
OutputBlock
instead ofPredictionsTasks
#914Testing