Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in Quick-start for Ranking #1014

Merged
merged 4 commits into from
Jun 15, 2023

Conversation

gabrielspmoreira
Copy link
Member

@gabrielspmoreira gabrielspmoreira commented Jun 14, 2023

This PR adds some improvements to the Quick-start for ranking scripts and documentation

In preprocessing.py:

  • Adds support to target encoding features, configurable through these new CLI arguments: --target_encoding_features, --target_encoding_targets, --target_encoding_kfold, --target_encoding_smoothing.

In ranking.py:

  • Adds support to selecting some columns to keep (--keep_columns) or remove (--ignore_columns) from at dataloading / training / evaluation.

This PR also converts those scripts to Python modules, to make it easier to import/extend their classes and to test them.
So now, instead of being run like python preprocessing.py --args ..., they need to be run as a Python module, e.g.

cd /Merlin/examples/
python -m quick_start.scripts.preproc.preprocessing --args ...

…s classes can be tested). Added target encoding args to preprocessing.py. Added args to keep or filter columns in ranking.py. Documentation was updated.
@gabrielspmoreira gabrielspmoreira self-assigned this Jun 14, 2023
@gabrielspmoreira gabrielspmoreira added the enhancement New feature or request label Jun 14, 2023
@gabrielspmoreira gabrielspmoreira added this to the Merlin 23.06 milestone Jun 14, 2023
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-1014

@gabrielspmoreira gabrielspmoreira requested a review from rnyak June 14, 2023 21:15
--target_encoding_targets is, all categorical
features will be used.
--target_encoding_targets
Columns (comma-sep) with target columns that will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont giving multiple targets create issue? you were facing issues for that.. was that fixed? also what about test set needs target column issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I split the targets and create one TargetEncoding op for each to avoid the issue

@@ -0,0 +1,23 @@
13-06-2023 12:45:41:756 [pid=1014 tid=1014] ERROR cufio-drv:716 nvidia-fs.ko driver not loaded
13-06-2023 12:45:52:861 [pid=1156 tid=1156] ERROR cufio-drv:716 nvidia-fs.ko driver not loaded
13-06-2023 12:57:13:36 [pid=1737 tid=1737] ERROR cufio-drv:716 nvidia-fs.ko driver not loaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess you might want to remove cufile.log file.

args.target_encoding_features = args.categorical_features
if not args.target_encoding_targets:
args.target_encoding_targets = (
args.binary_classif_targets + args.regression_targets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you check if a target col is float (not an int) and target encoding works properly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gonna add the integration tests in another PR and check for those cases.

@@ -263,11 +301,36 @@ def generate_nvt_workflow_targets(self, client=None):
[Tags.REGRESSION, Tags.TARGET, Tags.BINARY]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why tagged as Binary as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Just removed it.

eval_dataset_preproc.to_parquet(
output_eval_dataset_path,
output_files=args.output_num_partitions,
)

if args.predict_data_path:
# Adding to predict set dummy target columns that are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not read well. may be rephrase as Adding a dummy target column(s) to the test set to perform target encoding op while this issue ...

@rnyak
Copy link
Contributor

rnyak commented Jun 15, 2023

@gabrielspmoreira I approved in case you want to merge once you push your final changes.

@gabrielspmoreira gabrielspmoreira merged commit e131376 into main Jun 15, 2023
gabrielspmoreira added a commit that referenced this pull request Jun 15, 2023
* Adding target encoding features support to quick-start preprocessing

* Converting the quick-start for ranking to a Python module (so that its classes can be tested). Added target encoding args to preprocessing.py. Added args to keep or filter columns in ranking.py. Documentation was updated.

* Fixed bbut when casting the columns (it was shuffling the cols in some cases). Adjusted the command line examples

* Small fix and comment adjustment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants