[FEA] Add options to reserve categorical indices in the Categorify() op #1074

gabrielspmoreira · 2021-08-27T18:21:02Z

Is your feature request related to a problem? Please describe.
Some pipelines trust that categorical features will have some reserved values after preprocessing.
For example, for sequential features (e.g. session-based recommendation, time series, text data) it is important to reserve a value for padding (e.g. 0) in the model side, as not all sequences might have the same length.
It is also important to reserve a value (e.g. 1) to map Out-Of-Vocabulary (OOV) categorical values that might appear in the transform() (e.g. during evaluation or inference)

Describe the solution you'd like
Create one argument for Categorify() op to set the desired oov_index (for which all OOV values will be mapped during the preproc) and another argument start_index to set the first value that should be used to encode known values (e.g. 2). We can also have a similar option for null values (null_index), so that they are mapped to a known index.
In this case, if we call Categorify(oov_index=1, null_index=2, start_index=3), the index 0 could be safely used for padding for example.
I think we don't need an padding_index, because it is specific to sequential features. It is better to give users flexibility to reserve a number of categorical values by using start_index

The text was updated successfully, but these errors were encountered:

rnyak · 2021-08-27T18:28:12Z

Related to NVIDIA-Merlin/Transformers4Rec#160

EvenOldridge · 2021-08-30T21:57:52Z

@gabrielspmoreira Do we really need that level of complexity? Can't we just specify the offset? Or does categorify work differently if you specify oov, null, and start_index ranges.

Also do we want to handle the case where people specify ridiculous values? What if I say oov = 100000000?

I think a simple reserved offset is more straightforward and less error prone.

gabrielspmoreira · 2021-08-30T22:28:05Z

@EvenOldridge I agree we can simplify the solution.
With the offset (start_index) we solve the issue of reserving 0 for padding.
And we could reserve the first two indices after start_index / offset for null and OOV during transform() (inference).
What do you think @benfred ?

benfred · 2021-08-31T00:11:27Z

I think we should simplify, and just specify a start_index with a default value of 0.

The implementation of Categorify has gotten pretty complex - so I thought it might help to put some pointers on where to get started with this:

There are two phases in Categorify: the 'fit' which calculates the mapping of input label to categorical id, and the 'transform' which applies the mapping. We'll need to change both of these to handle this option
For the 'fit' stage, we will need to store this option in the 'FitOptions' struct and then use in the _write_uniques function https://github.com/NVIDIA/NVTabular/blob/57855b872444fee1c249ad6f6f190a73aa8a81f3/nvtabular/ops/categorify.py#L889-L904 .
The 'transform' code needs updated in the _encode function to set the unknown values to the start_index specified by the user. The simple case where we aren't doing any OOV hashing will involve updating these lines https://github.com/NVIDIA/NVTabular/blob/57855b872444fee1c249ad6f6f190a73aa8a81f3/nvtabular/ops/categorify.py#L1110-L1116 , but this will get more complicated with the hash bucket code. It might be worth just trying to get this working without hashing first, and then tackle hashing next.
There is currently a 'na_sentinel' parameter on the categorify op https://github.com/NVIDIA/NVTabular/blob/57855b872444fee1c249ad6f6f190a73aa8a81f3/nvtabular/ops/categorify.py#L146-L147 . Unfortunately I don't think this works appropriately right now, and probably should be removed as an option from Categorify.

lesnikow · 2021-09-02T16:03:52Z

I assigned this issue to myself, but to follow what is in CONTRIBUTING.md and to make it explicit for tracking who is working on what, I am working on this issue currently.

lesnikow · 2021-09-02T16:58:07Z

@benfred @rjzamora Would one of you have some time today that I could meet with you to ask about how to implement this additional argument? I read over the current implementation of the class, but I am not fully understanding what the particular methods Ben outlined are supposed to do. I tagged you two since it looks like you have spent the most time working on this module. I would appreciate it if so.

lesnikow · 2021-09-02T19:54:15Z

@gabrielspmoreira, someone else, would you have a test case available to test the appropriate desired functionality? I found this test from tests/unit/ops/test_ops.py that tests Categorify that looks like a helpful template.

def test_categorify_lists(tmpdir, freq_threshold, cpu, dtype, vocabs):
    df = dispatch._make_df(
        {
            "Authors": [["User_A"], ["User_A", "User_E"], ["User_B", "User_C"], ["User_C"]],
            "Engaging User": ["User_B", "User_B", "User_A", "User_D"],
            "Post": [1, 2, 3, 4],
        }
    ) 

    cat_names = ["Authors", "Engaging User"]
    label_name = ["Post"]

    cat_features = cat_names >> ops.Categorify(
        out_path=str(tmpdir), freq_threshold=freq_threshold, dtype=dtype, vocabs=vocabs
    )

    workflow = nvt.Workflow(cat_features + label_name)
    df_out = workflow.fit_transform(nvt.Dataset(df, cpu=cpu)).to_ddf().compute()

    # Columns are encoded independently
    if cpu:
        assert df_out["Authors"][0].dtype == np.dtype(dtype) if dtype else np.dtype("int64")
        compare = [list(row) for row in df_out["Authors"].tolist()]
    else:
        assert df_out["Authors"].dtype == cudf.core.dtypes.ListDtype(dtype if dtype else "int64")
        compare = df_out["Authors"].to_arrow().to_pylist()

    if freq_threshold < 2 or vocabs is not None:
        assert compare == [[1], [1, 4], [3, 2], [2]]
    else:
        assert compare == [[1], [1, 0], [0, 2], [2]]

I would like to implement and test a feature add for this issue, but I am not exactly sure how to make a test to test the desired addition of Categorify with a new start_index argument.

Add tests for issue #1074 on adding a start_index argument to the Categorify op. The motivation for this issue is to allow an offset for encoding out-of-vocabulary and other special values. In this commit, we add a test function to test various values of our new start_index arg in tests/unit/ops/text_ops.py and add a start_value arg to the class signature for Categorify(), documentation for its intended function in Categorify()'s docstring, a start_value arg to FitOptions(), and documentation for this new argument in FitOptions()'s docstring.

lesnikow · 2021-09-02T22:57:40Z

@gabrielspmoreira I did not see how to add you as a reviewer for my draft PR, but when you have a chance, would you be able to see whether the implemented tests test your intended functionality? This draft PR consists of the commit 642e5f5.

This commit implements the feature in issue #1074. This issue asks to add an argument start_index to Categorify to give an offset for translating vocabulary items to categorical values. We update nvtabular/ops/categorify.py to add a start_index arg in the implementation of Categorify(). This update touches the categorify.py module in various places. We also add docstrings to the _encode() and _write_uniques() methods for improved readability in categorify.py. We also update the test_categorify_lists_with_start_index() test method in tests/unit/test_ops.py to test various start_index values.

@benfred

Change start_indexi's default value to 0 from 1. The motivation for this is to have more Pythonic code, and for increased consistency with the rest of the updated modules. This implements the suggestion made by @benfred towards merging the implementation of issue #1074's feature request. This change includes modifying what Categorify()'s _encode() method does, default arg value updates in some other methods, docstring updating, and updating Categorify()'s tests that test start_index's behavior.

Adjust _get_embeddings_dask() to add start_index in output embedding. This is towards implementing issue #1074.

Add unit test for start_index and get_embedding_size() for the change made in commit fa172a. This is for implementing the feature request of issue #1074.

rnyak added the session-based label Aug 27, 2021

EvenOldridge assigned EvenOldridge and unassigned EvenOldridge Aug 30, 2021

viswa-nvidia added this to the NVTabular-v21.09 milestone Aug 31, 2021

lesnikow self-assigned this Sep 1, 2021

lesnikow mentioned this issue Sep 2, 2021

[HELP-REQ] Add tests for issue #1074 and update Categorify args #1093

Merged

lesnikow linked a pull request Sep 11, 2021 that will close this issue

[HELP-REQ] Add tests for issue #1074 and update Categorify args #1093

Merged

lesnikow pushed a commit that referenced this issue Sep 13, 2021

Adjust _get_embeddings_dask() to add start_index in output

fa172a7

Adjust _get_embeddings_dask() to add start_index in output embedding. This is towards implementing issue #1074.

lesnikow pushed a commit that referenced this issue Sep 13, 2021

Add unit test for start_index and get_embedding_size()

b8458ce

Add unit test for start_index and get_embedding_size() for the change made in commit fa172a. This is for implementing the feature request of issue #1074.

benfred closed this as completed in #1093 Sep 13, 2021

benfred pushed a commit that referenced this issue Sep 13, 2021

Add tests for issue #1074 and update Categorify args (#1093)

5af5f8e

mikemckiernan pushed a commit that referenced this issue Nov 24, 2022

Add tests for issue #1074 and update Categorify args (#1093)

5dd2820

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add options to reserve categorical indices in the Categorify() op #1074

[FEA] Add options to reserve categorical indices in the Categorify() op #1074

gabrielspmoreira commented Aug 27, 2021

rnyak commented Aug 27, 2021

EvenOldridge commented Aug 30, 2021

gabrielspmoreira commented Aug 30, 2021

benfred commented Aug 31, 2021

lesnikow commented Sep 2, 2021

lesnikow commented Sep 2, 2021

lesnikow commented Sep 2, 2021 •

edited

Loading

lesnikow commented Sep 2, 2021

[FEA] Add options to reserve categorical indices in the Categorify() op #1074

[FEA] Add options to reserve categorical indices in the Categorify() op #1074

Comments

gabrielspmoreira commented Aug 27, 2021

rnyak commented Aug 27, 2021

EvenOldridge commented Aug 30, 2021

gabrielspmoreira commented Aug 30, 2021

benfred commented Aug 31, 2021

lesnikow commented Sep 2, 2021

lesnikow commented Sep 2, 2021

lesnikow commented Sep 2, 2021 • edited Loading

lesnikow commented Sep 2, 2021

lesnikow commented Sep 2, 2021 •

edited

Loading