updates notebooks for multistage with subgraphs #1022

jperez999 · 2023-06-21T22:27:02Z

This PR updates the multi stage recsys notebooks to use subgraph. There are quite a few changes that occur in there. Some major changes include a shift in flow of what information is stored in the feature store. Now the feature store holds raw item and raw user information. These values have not been pre processed. So when they are retrieved in the systems graph they must go through a preprocessing step. In the first notebook, we add usage of the Subgraph operator, and we create to subgraphs one for item and one for user. We also create another subgraph for the item categorification. This is so that we can categorify the item_features separately when they are used to retrieve item embeddings. In the second notebook, the use of subgraphs forces the ensemble to introduce NVT workflows to handle the preprocessing of the data after it is retrieved for both users and items. Here is where we introduce the usage of subworkflows, which are based on subgraph.

This PR depends on the following PRs:
NVIDIA-Merlin/core#349
NVIDIA-Merlin/systems#372
NVIDIA-Merlin/core#350
NVIDIA-Merlin/core#353
NVIDIA-Merlin/systems#378

review-notebook-app · 2023-06-21T22:27:07Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2023-06-21T22:29:28Z

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-1022

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

rnyak · 2023-06-23T18:46:04Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


may be we can add some text right above this cell to explain what we are doing here and what we are registering to feast repo. I can do that.

Reply via ReviewNB

rnyak · 2023-06-23T18:46:05Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


we need to explain this cell what's going on here, since cat_wkflow = nvt_wkflow.get_subworkflow("items_cat") is a new syntax..

Reply via ReviewNB

rnyak · 2023-06-23T18:46:05Z

...g-and-deploying-multi-stage-RecSys/02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #5. view="user_features",
guess view should be user_attributes ?

Reply via ReviewNB

@jperez999 and @karlhigley changing view to user_attributes give error since it was saved to feature store as user_features. so we should leave it as it is.

yeah you cant change the name of the view, that comes from the feast configs at the bottom of notebook 1.

You can change it if it’s changed in both places though, right?

rnyak · 2023-06-23T18:46:05Z

...g-and-deploying-multi-stage-RecSys/02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #3. view="item_features",
similarly, here view should be item_attributes? but if we use "item_attributes" it gives error since we registered it at item_features in the feature store.

Reply via ReviewNB

Same as above for this change. You cannot make this change unless we change the configs.

I think we probably should change both for the sake of clarity

rnyak · 2023-06-26T18:19:29Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #14. os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
guess we need to move this line right after import os to make it effective.

Reply via ReviewNB

As long as import os is done before it does not have to be right before. If you look further up in that block os is the first import.

@jperez999 I wanted to say if os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async" comes after import tensorflow it might NOT work.

I think it has to happen in with the os call first and import tensorflow second so that TF picks up the config from the environment when it loads

I say let's move os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async" right after import os based on my experience and before import tensorflow :)

rnyak · 2023-06-26T19:12:43Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #5. item_embs_df["item_id"] = item_features["item_id"]
I am skeptical that this step is correct. This is not doing correct mapping. if we are trying to replace encoded items with their original raw item_ids, I think, we should use unique.item_id.parquet. so we can do something like

raw_item_ids = pd.read_parquet('./categories/unique.item_id.parquet')
item_embs_df["item_id"] = raw_item_ids["item_id"].values

Reply via ReviewNB

In addition, are we sure we want to save item_embeddings for raw item_ids? Arent we supposed to do KNN search after retrieval stage based on the encoded items? If yes, we should then get rid of the mapping scripts above.

You need to save the embeddings with raw_item_ids because those embeddings should correspond to raw data. This way when you do the lookup in the second notebook you are getting the actual items. Those items are then processed through the item subworkflow to get the features for the model to rank the items. Suppose you wanted to use that similarity search with a different feature engineering workflow, if you had placed the data with the item ids already categorified that would not work. For the item_id to raw_item_ids you are correct that it is not correctly merged. However, I pose the question why are we sorting IDs by default? I see that we use it in the code for categorify to handle making na/nulls the first items in the dataframe... but as a user I think I expect to see the categories given in a first come first serve order. Unless I elect to use freq_threshold or kwarg that changes the strategy. Anyway I will fix for now.

rnyak · 2023-06-27T00:13:39Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #1. user_id_raw = ["user_id"] >> Rename(postfix='_raw') >> LambdaOp(lambda col: col.astype("int32")) >> TagAsUserFeatures()
I think we no longer need user_id_raw and item_id_raw since we do not register them in the feature store.

Reply via ReviewNB

That is correct but I wanted to leave it so people can see the incremental changes. I will remove it.

rnyak · 2023-06-27T00:13:40Z

...g-and-deploying-multi-stage-RecSys/02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #2. store=feature_store,
I am not sure how this step works without registering any encoded ids in the feature store. I assume retrieval["candidate_ids"] returns encoded item ids? then, how feature store can map encoded ids to the raw ids if it does not have such mapping info stored?

Reply via ReviewNB

Those candidate IDs come from an embedding space, that embedding space is loaded into a vector database to be able to conduct similarity search. The IDs returned here are the raw item IDs (aka the actual names) from the previous notebook (note they are currently misrepresented, because categorify always sorts the values in a dataframe, I will fix it shortly). So you take the user features and pass them to the retrieval model to get the user embeddings. Those embeddings are overlayed in the item embedding space and then we run a similarity search against those to find the items that are closest (most similar) to the user embedding supplied. Once we have the real item_ids (should be item names/uuid) we can query for the features of those items in the feature store. The records we end up with are real item id/feature values that have not been processed yet. That is why the next step is to process them using the item subworkflow.

@jperez999 as long as retrieval["candidate_ids"] return raw item_ids then all good.

This would be a lot easier to sort through if the dataset generation returned string ids for categorical columns, because we’d then be able to visually tell whether we were looking at raw or encoded values 🤔

@karlhigley agreed :)

I took a stab at adding string id columns to the dataset generation this morning, and quickly realized there are some bigger issues to sort out re: synthetic data and how our libraries relate to it. @oliverholworthy is going to take a look and try to figure out where we should go with that, but for now I think we're going to have to continue with integer IDs. I wonder if we could make them more visually distinct from the Categorified IDs if we added a large offset to the raw IDs (e.g. 1_000_000)?

I took a stab at adding string id columns to the dataset generation this morning, and quickly realized there are some bigger issues to sort out re: synthetic data and how our libraries relate to it. @oliverholworthy is going to take a look and try to figure out where we should go with that, but for now I think we're going to have to continue with integer IDs. I wonder if we could make them more visually distinct from the Categorified IDs if we added a large offset to the raw IDs (e.g. 1_000_000)?

@karlhigley if this is a nice to have things we dont have to prioritize it now. your call, I am ok with moving with integer IDs :)

rnyak · 2023-06-27T00:15:00Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #6. schema = train.schema.without(['user_id_raw', 'item_id_raw'])
if we remove user_id_raw and item_id_raw above from NVT pipeline, we can remove .without here as well.

Reply via ReviewNB

rnyak · 2023-06-27T16:48:12Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Let's rephrase this as

"We start with the offline candidate retrieval stage. We are going to train a Two-Tower model for item retrieval using only positive interactions, that's why we apply Filter operator below. To learn ..."

Reply via ReviewNB

rnyak · 2023-06-27T16:48:12Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


we already have a section "Feature Engineering with NVTabular" above. So better to remove this title here.

Reply via ReviewNB

rnyak · 2023-06-27T16:48:12Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Let move this paragraph up right under Feature Engineering with NVTabular section above.

Reply via ReviewNB

rnyak · 2023-06-27T20:01:46Z

@jperez999 we need to update the unit test as well.. I can push it to this PR .

rnyak · 2023-06-28T21:50:11Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #3. )
so we can save this as user_attributes if you want like below:

user_features.to_parquet(
os.path.join(feature_repo_path, "data", "user_attributes.parquet")
)

Reply via ReviewNB

rnyak · 2023-06-28T21:50:12Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


so we can save this as item_attributes if you want like below:
item_features.to_parquet(
os.path.join(feature_repo_path, "data", "item_attributes.parquet")
)

Reply via ReviewNB

… setup

…rez999/Merlin into multi-stage-subgraph-support

karlhigley

Looks good from the perspective of code changes. @rnyak, let’s update the text in a second PR

rnyak · 2023-07-05T15:50:44Z

tests/unit/examples/test_building_deploying_multi_stage_RecSys.py

@@ -74,7 +74,7 @@ def test_func(tmpdir):
            df_lib = get_lib()
            train = df_lib.read_parquet(
                os.path.join("{tmpdir / "data"}/processed_nvt/", "train", "part_0.parquet"),


is that sending a request of transformed dataset instead of raw user_id?

rnyak · 2023-07-05T16:38:41Z

.../Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb

@@ -7,7 +7,7 @@
   "metadata": {},


Line #4. workflow = nvt.Workflow(["item_id"] + (['item_id', 'item_brand', 'item_category', 'item_shop'] >> TransformWorkflow(nvt_wkflow.get_subworkflow("item")) >> PredictTensorflow(model_tt.first.item_block())))
This step is not really intuitive from user perspective. we should explain why we have two times item_id? why we need to do fit_transform not only transform? why are we fitting again?

Reply via ReviewNB

This line and comment have us way down a rabbit-hole fixing underlying issues that make this awkward 😅

updates notebooks for multistage with subgraphs

65c6373

jperez999 added enhancement New feature or request chore Infrastructure update labels Jun 21, 2023

jperez999 added this to the Merlin 23.07 milestone Jun 21, 2023

jperez999 requested review from karlhigley and rnyak June 21, 2023 22:27

jperez999 self-assigned this Jun 21, 2023

rnyak reviewed Jun 23, 2023

View reviewed changes

rnyak reviewed Jun 26, 2023

View reviewed changes

rnyak reviewed Jun 27, 2023

View reviewed changes

jperez999 mentioned this pull request Jun 27, 2023

[RMP] Subgraph capabilities across Merlin #1019

Open

10 tasks

jperez999 linked an issue Jun 27, 2023 that may be closed by this pull request

[RMP] Subgraph capabilities across Merlin #1019

Open

10 tasks

rnyak reviewed Jun 27, 2023

View reviewed changes

jperez999 and others added 2 commits June 27, 2023 17:38

remove unnecessary raw_id calls

a996c7b

Merge branch 'main' into multi-stage-subgraph-support

dfdfa7a

rnyak reviewed Jun 28, 2023

View reviewed changes

jperez999 added 3 commits June 29, 2023 15:22

changes to multistage example to clean up embeddings and update faiss…

f877254

… setup

Merge branch 'multi-stage-subgraph-support' of https://github.com/jpe…

8081881

…rez999/Merlin into multi-stage-subgraph-support

update notebooks with output

68b66af

rnyak mentioned this pull request Jul 3, 2023

[BUG] The multi-stage example is broken due to recent changes in systems #1042

Closed

karlhigley linked an issue Jul 3, 2023 that may be closed by this pull request

[BUG] The multi-stage example is broken due to recent changes in systems #1042

Closed

jperez999 and others added 2 commits July 4, 2023 16:09

add data folder env for notebook 2 and fix unit test

3a70d01

Merge branch 'main' into multi-stage-subgraph-support

d1d9636

karlhigley approved these changes Jul 5, 2023

View reviewed changes

Merge branch 'main' into multi-stage-subgraph-support

eeb7add

rnyak reviewed Jul 5, 2023

View reviewed changes

jperez999 merged commit 237fe4b into NVIDIA-Merlin:main Jul 5, 2023

jperez999 mentioned this pull request Jul 5, 2023

[RMP] Support Offline Batch processing of Recs Generation Pipelines #419

Open

13 tasks

rnyak reviewed Jul 5, 2023

View reviewed changes

updates notebooks for multistage with subgraphs #1022

updates notebooks for multistage with subgraphs #1022

Conversation

jperez999 commented Jun 21, 2023 • edited Loading

review-notebook-app bot commented Jun 21, 2023

github-actions bot commented Jun 21, 2023

Documentation preview

rnyak Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jun 30, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak commented Jun 27, 2023 • edited Loading

rnyak Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

rnyak Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

karlhigley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jperez999 commented Jun 21, 2023 •

edited

Loading

rnyak Jun 23, 2023 •

edited

Loading

rnyak Jun 23, 2023 •

edited

Loading

rnyak Jun 23, 2023 •

edited

Loading

rnyak Jun 23, 2023 •

edited

Loading

rnyak Jun 26, 2023 •

edited

Loading

rnyak Jun 28, 2023 •

edited

Loading

rnyak Jun 28, 2023 •

edited

Loading

rnyak Jun 26, 2023 •

edited

Loading

rnyak Jun 26, 2023 •

edited

Loading

rnyak Jun 27, 2023 •

edited

Loading

rnyak Jun 27, 2023 •

edited

Loading

rnyak Jun 30, 2023 •

edited

Loading

rnyak Jun 27, 2023 •

edited

Loading

rnyak Jun 27, 2023 •

edited

Loading

rnyak Jun 27, 2023 •

edited

Loading

rnyak Jun 27, 2023 •

edited

Loading

rnyak commented Jun 27, 2023 •

edited

Loading

rnyak Jun 28, 2023 •

edited

Loading

rnyak Jun 28, 2023 •

edited

Loading

rnyak Jul 5, 2023 •

edited

Loading