Address virtual dev review comments

This commit also includes items flagged by flake8_nb.
NVIDIA-Merlin · Sep 19, 2022 · a203db2 · a203db2
1 parent e348633
commit a203db2
Show file tree

Hide file tree

Showing 15 changed files with 44 additions and 51 deletions.
diff --git a/README.md b/README.md
@@ -31,7 +31,7 @@ build recommender systems from end to end. With NVIDIA Merlin, you can:
 NVIDIA Merlin consists of the following open source libraries:
 
 **[NVTabular](https://github.com/NVIDIA-Merlin/NVTabular)**
-[![PyPI version shields.io](https://img.shields.io/pypi/v/nvtabular.svg)](https://pypi.python.org/pypi/nvtabular/)
+[![PyPI version shields.io](https://img.shields.io/pypi/v/nvtabular.svg)](https://pypi.org/project/nvtabular/)
 [![ Documentation](https://img.shields.io/badge/documentation-blue.svg)](https://nvidia-merlin.github.io/NVTabular/main/Introduction.html)
 <br> NVTabular is a feature engineering and preprocessing library for tabular
 data. The library can quickly and easily manipulate terabyte-size datasets that
@@ -58,7 +58,7 @@ HugeCTR, you can:
   manner during the training stage.
 
 **[Merlin Models](https://github.com/NVIDIA-Merlin/models)**
-[![PyPI version shields.io](https://img.shields.io/pypi/v/merlin-models.svg)](https://pypi.python.org/pypi/merlin-models/)<br>
+[![PyPI version shields.io](https://img.shields.io/pypi/v/merlin-models.svg)](https://pypi.org/project/merlin-models/)<br>
 The Merlin Models library provides standard models for recommender systems with
 an aim for high-quality implementations that range from classic machine learning
 models to highly-advanced deep learning models. With Merlin Models, you can:
@@ -72,7 +72,7 @@ models to highly-advanced deep learning models. With Merlin Models, you can:
   you can create of new models quickly and easily.
 
 **[Merlin Systems](https://github.com/NVIDIA-Merlin/systems)**
-[![PyPI version shields.io](https://img.shields.io/pypi/v/merlin-systems.svg)](https://pypi.python.org/pypi/merlin-systems/)<br>
+[![PyPI version shields.io](https://img.shields.io/pypi/v/merlin-systems.svg)](https://pypi.org/project/merlin-systems/)<br>
 Merlin Systems provides tools for combining recommendation models with other
 elements of production recommender systems like feature stores, nearest neighbor
 search, and exploration strategies into end-to-end recommendation pipelines that
@@ -86,7 +86,7 @@ can be served with Triton Inference Server. With Merlin Systems, you can:
   in recommender system pipelines.
 
 **[Merlin Core](https://github.com/NVIDIA-Merlin/core)**
-[![PyPI version shields.io](https://img.shields.io/pypi/v/merlin-core.svg)](https://pypi.python.org/pypi/merlin-core/)<br>
+[![PyPI version shields.io](https://img.shields.io/pypi/v/merlin-core.svg)](https://pypi.org/project/merlin-core/)<br>
 Merlin Core provides functionality that is used throughout the Merlin ecosystem.
 With Merlin Core, you can:
 
@@ -99,7 +99,7 @@ With Merlin Core, you can:
 
 ## Installation
 
-The simplest way to use Merlin is to run a docker container. NVIDIA GPU Cloud (NCG) provides containers that include all the Merlin component libraries, dependencies, and receive unit and integration testing. For more information, see the [Containers](https://nvidia-merlin.github.io/Merlin/main/containers.html) page.
+The simplest way to use Merlin is to run a docker container. NVIDIA GPU Cloud (NGC) provides containers that include all the Merlin component libraries, dependencies, and receive unit and integration testing. For more information, see the [Containers](https://nvidia-merlin.github.io/Merlin/main/containers.html) page.
 
 To develop and contribute to Merlin, review the installation documentation for each component library. The development environment for each Merlin component is easily set up with `conda` or `pip`:
 
@@ -130,7 +130,7 @@ real-world use cases.
 **[cuDF](https://github.com/rapidsai/cudf)**<br> Merlin relies on cuDF for
 GPU-accelerated DataFrame operations used in feature engineering.
 
-**[Dask](https://dask.org/)**<br> Merlin relies on Dask to distribute and scale
+**[Dask](https://www.dask.org/)**<br> Merlin relies on Dask to distribute and scale
 feature engineering and preprocessing within NVTabular and to accelerate
 dataloading in Merlin Models and HugeCTR.
 

diff --git a/docs/source/containers.rst b/docs/source/containers.rst
@@ -2,7 +2,7 @@ Merlin Containers
 =================
 
 Merlin and the Merlin component libraries are available in Docker containers from the NVIDIA GPU Cloud (NCG) catalog.
-Access the catalog of containers at http://ngc.nvidia.com/catalog/containers.
+Access the catalog of containers at https://catalog.ngc.nvidia.com/containers.
 
 The following table identifies the container names, catalog URL, and key Merlin components.
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -6,7 +6,7 @@ systems on NVIDIA GPUs.
 The library brings together several component libraries that simplifies
 the process of building high-performing recommenders at scale.
 
-For more information, see the `Introduction <README.html>`_.
+For more information, see the `Introduction <README.md>`_.
 
 Related Resources
 -----------------

diff --git a/docs/source/toc.yaml b/docs/source/toc.yaml
@@ -38,15 +38,13 @@ subtrees:
                     title: Criteo Download and Convert
                   - file: examples/scaling-criteo/02-ETL-with-NVTabular.ipynb
                     title: Feature Engineering with NVTabular
-                  - file: examples/scaling-criteo/03-Training-with-FastAI.ipynb
-                    title: Training with FastAI
                   - file: examples/scaling-criteo/03-Training-with-HugeCTR.ipynb
                     title: Training with HugeCTR
-                  - file: examples/scaling-criteo/03-Training-with-TF.ipynb
-                    title: Training with TensorFlow
+                  - file: examples/scaling-criteo/03-Training-with-Merlin-Models-TensorFlow.ipynb
+                    title: Training with Merlin Models and TensorFlow
                   - file: examples/scaling-criteo/04-Triton-Inference-with-HugeCTR.ipynb
-                    title: Serving the HugeCTR Model with Triton
-                  - file: examples/scaling-criteo/04-Triton-Inference-with-TF.ipynb
-                    title: Serving the TensorFlow Model with Triton
+                    title: Deploy the HugeCTR Model with Triton
+                  - file: examples/scaling-criteo/04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb
+                    title: Deploy the TensorFlow Model with Triton
       - file: containers.rst
       - file: support_matrix/index.rst
diff --git a/...ilding-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb b/...ilding-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb
@@ -50,7 +50,7 @@
    "id": "405280b0-3d48-43b6-ab95-d29be7a43e9e",
    "metadata": {},
    "source": [
-    "The figure below represents a four-stage recommender systems. This is more complex process than only training a single model and deploying it, and it is much more realistic and closer to what's happening in the real-world recommender production systems."
+    "The following figure represents a four-stage recommender system. This is a more complex process than only training a single model and deploying it. The figure shows a more realistic process and closer to what's happening in the real-world recommender production systems."
    ]
   },
   {
@@ -169,7 +169,6 @@
     "    TagAsUserFeatures,\n",
     "    AddMetadata,\n",
     "    Filter,\n",
-    "    Rename\n",
     ")\n",
     "\n",
     "from merlin.schema.tags import Tags\n",

diff --git a/examples/Building-and-deploying-multi-stage-RecSys/README.md b/examples/Building-and-deploying-multi-stage-RecSys/README.md
@@ -6,19 +6,19 @@ The notebooks demonstrate how to use the NVTabular, Merlin Models, and Merlin Sy
 
 The two example notebooks are structured as follows:
 
-- [Building the Recommender System](01-Building-Recommender-Systems-with-Merlin.ipynb): 
+- [Building the Recommender System](01-Building-Recommender-Systems-with-Merlin.ipynb):
   - Execute the preprocessing and feature engineering pipeline (ETL) with NVTabular on the GPU/CPU.
   - Train a ranking and retrieval model with TensorFlow based on the ETL output.
   - Export the saved models, user and item features, and item embeddings.
 
-- [Deploying the Recommender System with Triton](02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb): 
+- [Deploying the Recommender System with Triton](02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb):
   - Set up a Feast feature store for feature storing and a Faiss index for similarity search.
   - Build a multi-stage recommender system ensemble pipeline with Merlin Systems operators.
   - Perform inference with the Triton Inference Server using the Merlin Systems library.
 
 ## Running the Example Notebooks
 
-Merlin docker containers are available on http://ngc.nvidia.com/catalog/containers/ with pre-installed versions. For `Building-and-deploying-multi-stage-RecSys` example notebooks we used `merlin-tensorflow-inference` container that has NVTabular with TensorFlow and Triton Inference support.
+Merlin docker containers are available on <https://catalog.ngc.nvidia.com/containers> with pre-installed versions. For `Building-and-deploying-multi-stage-RecSys` example notebooks we used `merlin-tensorflow-inference` container that has NVTabular with TensorFlow and Triton Inference support.
 
 To run the example notebooks using Docker containers, do the following:
 
@@ -32,7 +32,7 @@ The container will open a shell when the run command execution is completed. You
    ```
    pip install jupyterlab
    ```
-   
+
    For more information, see [Installation Guide](https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html).
 
 2. Start the jupyter-lab server by running the following command:

diff --git a/examples/README.md b/examples/README.md
@@ -1,7 +1,7 @@
 # NVIDIA Merlin Example Notebooks
 
 We have a collection of Jupyter example notebooks that are based on different datasets to provide end-to-end examples for NVIDIA Merlin.
-These example notebooks demonstrate how to use NVTabular with TensorFlow, PyTorch, [HugeCTR](https://github.com/NVIDIA/HugeCTR) and [Merlin Models](https://github.com/NVIDIA-Merlin/models).
+These example notebooks demonstrate how to use NVTabular with TensorFlow, PyTorch, [HugeCTR](https://github.com/NVIDIA-Merlin/HugeCTR) and [Merlin Models](https://github.com/NVIDIA-Merlin/models).
 Each example provides additional details about the end-to-end workflow, such as includes ETL, training, and inference.
 
 ## Inventory
@@ -31,20 +31,19 @@ Many users are familiar with this dataset, so the notebooks focus primarily on t
 ### [Scaling Large Datasets with Criteo](./scaling-criteo)
 
 [Criteo](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/) provides the largest publicly available dataset for recommender systems with a size of 1TB of uncompressed click logs that contain 4 billion examples.
-These notebooks demonstrate how to scale NVTabular:
 
-- Use multiple GPUs and nodes with NVTabular for ETL.
-- Train recommender system models with the NVTabular dataloader for PyTorch.
-- Train recommender system models with the NVTabular dataloader for TensorFlow.
-- Train recommender system models with the NVTabular dataloader using Merlin Models.
-- Train recommender system models with HugeCTR using a multi-GPU.
-- Inference with the Triton Inference Server and TensorFlow or HugeCTR.
+These notebooks demonstrate how to scale NVTabular as well as the following:
+
+- Use multiple GPUs and nodes with NVTabular for feature engineering.
+- Train recommender system models with the Merlin Models for TensorFlow.
+- Train recommender system models with HugeCTR using multiple GPUs.
+- Inference with the Triton Inference Server and Merlin Models for TensorFlow or HugeCTR.
 
 ## Running the Example Notebooks
 
 You can run the examples with Docker containers.
 Docker containers are available from the NVIDIA GPU Cloud catalog.
-Access the catalog of containers at <http://ngc.nvidia.com/catalog/containers>.
+Access the catalog of containers at <https://catalog.ngc.nvidia.com/containers>.
 
 Depending on which example you want to run, you should use any one of these Docker containers:
 

diff --git a/examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb b/examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb
@@ -576,7 +576,7 @@
    "source": [
     "In general, the `Op`s in our `Workflow` will require measurements of statistical properties of our data in order to be leveraged. For example, the `Normalize` op requires measurements of the dataset mean and standard deviation, and the `Categorify` op requires an accounting of all the categories a particular feature can manifest. However, we frequently need to measure these properties across datasets which are too large to fit into GPU memory (or CPU memory for that matter) at once.\n",
     "\n",
-    "NVTabular solves this by providing the `Dataset` class, which breaks a set of parquet or csv files into into a collection of `cudf.DataFrame` chunks that can fit in device memory. The main purpose of this class is to abstract away the raw format of the data, and to allow other NVTabular classes to reliably materialize a `dask_cudf.DataFrame` collection (and/or collection-based iterator) on demand. Under the hood, the data decomposition corresponds to the construction of a [dask_cudf.DataFrame](https://docs.rapids.ai/api/cudf/stable/) object.  By representing our dataset as a lazily-evaluated [Dask](https://dask.org/) collection, we can handle the calculation of complex global statistics (and later, can also iterate over the partitions while feeding data into a neural network). `part_size` defines the size read into GPU-memory at once."
+    "NVTabular solves this by providing the `Dataset` class, which breaks a set of parquet or csv files into into a collection of `cudf.DataFrame` chunks that can fit in device memory. The main purpose of this class is to abstract away the raw format of the data, and to allow other NVTabular classes to reliably materialize a `dask_cudf.DataFrame` collection (and/or collection-based iterator) on demand. Under the hood, the data decomposition corresponds to the construction of a [dask_cudf.DataFrame](https://docs.rapids.ai/api/cudf/stable/) object.  By representing our dataset as a lazily-evaluated [Dask](https://www.dask.org/) collection, we can handle the calculation of complex global statistics (and later, can also iterate over the partitions while feeding data into a neural network). `part_size` defines the size read into GPU-memory at once."
    ]
   },
   {

diff --git a/examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb b/examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb
@@ -49,7 +49,7 @@
    "id": "16956c69",
    "metadata": {},
    "source": [
-    "### Why using HugeCTR?\n",
+    "### Why use HugeCTR?\n",
     "\n",
     "HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",
     "\n",

diff --git a/examples/getting-started-movielens/03-Training-with-PyTorch.ipynb b/examples/getting-started-movielens/03-Training-with-PyTorch.ipynb
@@ -204,7 +204,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, we take a look on our dataloader and how the data is represented as tensors. The NVTabular dataloader are initialized as usually and we specify both single-hot and multi-hot categorical features as cats. The dataloader will automatically recognize the single/multi-hot columns and represent them accordingly."
+    "First, we take a look on our dataloader and how the data is represented as tensors. The NVTabular dataloaders are initialized as usual and we specify both single-hot and multi-hot categorical features as cats. The dataloader can automatically recognize the single/multi-hot columns and represent them accordingly."
    ]
   },
   {

diff --git a/examples/getting-started-movielens/03-Training-with-TF.ipynb b/examples/getting-started-movielens/03-Training-with-TF.ipynb
@@ -716,7 +716,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, we can move to the next notebook, [04-Triton-Inference-with-TF.ipynb](https://github.com/NVIDIA/NVTabular/blob/main/examples/getting-started-movielens/04-Triton-Inference-with-TF.ipynb), to send inference request to the Triton IS."
+    "Now, we can move to the next notebook, [04-Triton-Inference-with-TF.ipynb](https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/04-Triton-Inference-with-TF.ipynb), to send inference request to the Triton IS."
    ]
   }
  ],

diff --git a/examples/scaling-criteo/02-ETL-with-NVTabular.ipynb b/examples/scaling-criteo/02-ETL-with-NVTabular.ipynb
@@ -40,7 +40,7 @@
     "\n",
     "NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems. It provides a high level abstraction to simplify code and accelerates computation on the GPU using the RAPIDS cuDF library.<br><br>\n",
     "\n",
-    "**In this notebook, we will show how to scale NVTabular to multi-GPUs and multiple nodes.** Prerequisite is to be familiar with NVTabular and its API. You can read more NVTabular and its API in our [Getting Started with Movielens notebooks](https://github.com/NVIDIA/NVTabular/tree/main/examples/getting-started-movielens).<br><br>\n",
+    "**In this notebook, we will show how to scale NVTabular to multi-GPUs and multiple nodes.** Prerequisite is to be familiar with NVTabular and its API. You can read more NVTabular and its API in our [Getting Started with Movielens notebooks](https://github.com/NVIDIA-Merlin/NVTabular/tree/main/examples/getting-started-movielens).<br><br>\n",
     "\n",
     "The full [Criteo 1TB Click Logs dataset](https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/) contains ~1.3 TB of uncompressed click logs containing over four billion samples spanning 24 days. In our benchmarks, we are able to preprocess and engineer features in **13.8min with 1x NVIDIA A100 GPU and 1.9min with 8x NVIDIA A100 GPUs**. This is a **speed-up of 100x-10000x** in comparison to different CPU versions, You can read more in our [blog](https://developer.nvidia.com/blog/announcing-the-nvtabular-open-beta-with-multi-gpu-support-and-new-data-loaders/).\n",
     "\n",
@@ -215,7 +215,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we configure and deploy a Dask Cluster. Please, [read this document](https://github.com/NVIDIA/NVTabular/blob/d419a4da29cf372f1547edc536729b0733560a44/bench/examples/MultiGPUBench.md) to know how to set the parameters."
+    "Now we configure and deploy a Dask Cluster. Please, [read this document](https://github.com/NVIDIA-Merlin/NVTabular/blob/d419a4da29cf372f1547edc536729b0733560a44/bench/examples/MultiGPUBench.md) to know how to set the parameters."
    ]
   },
   {

diff --git a/examples/scaling-criteo/03-Training-with-HugeCTR.ipynb b/examples/scaling-criteo/03-Training-with-HugeCTR.ipynb
@@ -43,7 +43,7 @@
     "\n",
     "HugeCTR is able to train recommender system models with larger-than-memory embedding tables by leveraging a parameter server. \n",
     "\n",
-    "You can find more information about HugeCTR [here](https://github.com/NVIDIA/HugeCTR).\n",
+    "You can find more information about HugeCTR from the [GitHub repository](https://github.com/NVIDIA-Merlin/HugeCTR).\n",
     "\n",
     "### Learning objectives\n",
     "\n",

diff --git a/examples/scaling-criteo/03-Training-with-Merlin-Models-TensorFlow.ipynb b/examples/scaling-criteo/03-Training-with-Merlin-Models-TensorFlow.ipynb
@@ -70,10 +70,9 @@
    "outputs": [],
    "source": [
     "import os\n",
-    "os.environ[\"TF_GPU_ALLOCATOR\"]=\"cuda_malloc_async\"\n",
+    "os.environ[\"TF_GPU_ALLOCATOR\"] = \"cuda_malloc_async\"\n",
     "\n",
     "import glob\n",
-    "import time\n",
     "import merlin.models.tf as mm\n",
     "from merlin.io.dataset import Dataset\n",
     "\n",
@@ -134,7 +133,7 @@
    "id": "97375902",
    "metadata": {},
    "source": [
-    "We will use Merlin Dataset object to initalize the dataloaders. It provides a dataset schema to initialize the model architectures. The [Merlin Models examples](https://github.com/NVIDIA-Merlin/models/tree/main/examples) will explain more details."
+    "We will use Merlin Dataset object to initialize the dataloaders. It provides a dataset schema to initialize the model architectures. The [Merlin Models examples](https://github.com/NVIDIA-Merlin/models/tree/main/examples) will explain more details."
    ]
   },
   {
@@ -153,7 +152,7 @@
    "id": "edf40eb9",
    "metadata": {},
    "source": [
-    "We initalize the DLRM architecture with Merlin Models."
+    "We initialize the DLRM architecture with Merlin Models."
    ]
   },
   {
@@ -192,11 +191,11 @@
     "%%time\n",
     "\n",
     "model.compile(optimizer=OPTIMIZER, run_eagerly=False)\n",
-    "model.fit(train, \n",
-    "          validation_data=valid, \n",
-    "          batch_size=BATCH_SIZE, \n",
+    "model.fit(train,\n",
+    "          validation_data=valid,\n",
+    "          batch_size=BATCH_SIZE,\n",
     "          epochs=EPOCHS\n",
-    ")"
+    "          )"
    ]
   },
   {
@@ -257,7 +256,7 @@
     "\n",
     "## Next steps\n",
     "\n",
-    "[The next step](04-Triton-Inference-with-Merlin-Models-TensorFlow) is to deploy the NVTabular workflow and DLRM model to production.\n",
+    "The next step  is to [deploy the NVTabular workflow and DLRM model](04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb) to production.\n",
     "\n",
     "If you are interested more in different architecture and training models with Merlin Models, we recommend to check out our [Merlin Models examples](https://github.com/NVIDIA-Merlin/models/)"
    ]