-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add options to reserve categorical indices in the Categorify() op #1074
Comments
Related to NVIDIA-Merlin/Transformers4Rec#160 |
@gabrielspmoreira Do we really need that level of complexity? Can't we just specify the offset? Or does categorify work differently if you specify oov, null, and start_index ranges. Also do we want to handle the case where people specify ridiculous values? What if I say oov = 100000000? I think a simple reserved offset is more straightforward and less error prone. |
@EvenOldridge I agree we can simplify the solution. |
I think we should simplify, and just specify a start_index with a default value of 0. The implementation of Categorify has gotten pretty complex - so I thought it might help to put some pointers on where to get started with this:
|
I assigned this issue to myself, but to follow what is in |
@benfred @rjzamora Would one of you have some time today that I could meet with you to ask about how to implement this additional argument? I read over the current implementation of the class, but I am not fully understanding what the particular methods Ben outlined are supposed to do. I tagged you two since it looks like you have spent the most time working on this module. I would appreciate it if so. |
@gabrielspmoreira, someone else, would you have a test case available to test the appropriate desired functionality? I found this test from
I would like to implement and test a feature add for this issue, but I am not exactly sure how to make a test to test the desired addition of |
Add tests for issue #1074 on adding a start_index argument to the Categorify op. The motivation for this issue is to allow an offset for encoding out-of-vocabulary and other special values. In this commit, we add a test function to test various values of our new start_index arg in tests/unit/ops/text_ops.py and add a start_value arg to the class signature for Categorify(), documentation for its intended function in Categorify()'s docstring, a start_value arg to FitOptions(), and documentation for this new argument in FitOptions()'s docstring.
@gabrielspmoreira I did not see how to add you as a reviewer for my draft PR, but when you have a chance, would you be able to see whether the implemented tests test your intended functionality? This draft PR consists of the commit 642e5f5. |
This commit implements the feature in issue #1074. This issue asks to add an argument start_index to Categorify to give an offset for translating vocabulary items to categorical values. We update nvtabular/ops/categorify.py to add a start_index arg in the implementation of Categorify(). This update touches the categorify.py module in various places. We also add docstrings to the _encode() and _write_uniques() methods for improved readability in categorify.py. We also update the test_categorify_lists_with_start_index() test method in tests/unit/test_ops.py to test various start_index values.
Change start_indexi's default value to 0 from 1. The motivation for this is to have more Pythonic code, and for increased consistency with the rest of the updated modules. This implements the suggestion made by @benfred towards merging the implementation of issue #1074's feature request. This change includes modifying what Categorify()'s _encode() method does, default arg value updates in some other methods, docstring updating, and updating Categorify()'s tests that test start_index's behavior.
Adjust _get_embeddings_dask() to add start_index in output embedding. This is towards implementing issue #1074.
Add unit test for start_index and get_embedding_size() for the change made in commit fa172a. This is for implementing the feature request of issue #1074.
Is your feature request related to a problem? Please describe.
Some pipelines trust that categorical features will have some reserved values after preprocessing.
For example, for sequential features (e.g. session-based recommendation, time series, text data) it is important to reserve a value for padding (e.g. 0) in the model side, as not all sequences might have the same length.
It is also important to reserve a value (e.g. 1) to map Out-Of-Vocabulary (OOV) categorical values that might appear in the transform() (e.g. during evaluation or inference)
Describe the solution you'd like
Create one argument for
Categorify()
op to set the desiredoov_index
(for which all OOV values will be mapped during the preproc) and another argumentstart_index
to set the first value that should be used to encode known values (e.g. 2). We can also have a similar option for null values (null_index
), so that they are mapped to a known index.In this case, if we call
Categorify(oov_index=1, null_index=2, start_index=3)
, the index 0 could be safely used for padding for example.I think we don't need an
padding_index
, because it is specific to sequential features. It is better to give users flexibility to reserve a number of categorical values by usingstart_index
The text was updated successfully, but these errors were encountered: