NVIDIA
diff --git a/‎README.md‎
Lines changed: 0 additions & 38 deletions b/‎README.md‎
Lines changed: 0 additions & 38 deletions
diff --git a/‎examples/commons/utils/logger.py‎
Lines changed: 17 additions & 3 deletions b/‎examples/commons/utils/logger.py‎
Lines changed: 17 additions & 3 deletions
diff --git a/‎examples/commons/utils/stringify.py‎
Lines changed: 3 additions & 3 deletions b/‎examples/commons/utils/stringify.py‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎examples/hstu/README.md‎
Lines changed: 5 additions & 107 deletions b/‎examples/hstu/README.md‎
Lines changed: 5 additions & 107 deletions
@@ -35,44 +35,6 @@ The project includes:
 </details>
 For more detailed release notes, please refer our [releases](https://github.com/NVIDIA/recsys-examples/releases).
 
-# Environment Setup
-## Start from dockerfile
-
-We provide [dockerfile](./docker/Dockerfile) for users to build environment. 
-```
-docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
-```
-If you want to build image for Grace, you can use 
-```
-docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
-```
-You can also set your own base image with args `--build-arg <BASE_IMAGE>`.
-
-## Start from source file
-Before running examples, build and install libs under corelib following instruction in documentation:
-- [HSTU attention documentation](./corelib/hstu/README.md)
-- [Dynamic Embeddings documentation](./corelib/dynamicemb/README.md)
-
-On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:
-
-```bash
-pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
-```
-
-If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code. 
-
-```bash
-git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
-pip install -e ./megatron-lm
-```
-
-We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:
-
-```bash
-cd /workspace/recsys-examples/examples/hstu && \
-python setup.py install
-```
-
 # Get Started
 The examples we supported:
 - [HSTU recommender examples](./examples/hstu/README.md)
 
@@ -12,16 +12,30 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from datetime import datetime
+import logging
 
 import torch
+from rich.console import Console
+from rich.logging import RichHandler
+
+# Set up logger with RichHandler if not already configured
+
+console = Console()
+_LOGGER = logging.getLogger("rich_rank0")
+
+if not _LOGGER.hasHandlers():
+    handler = RichHandler(
+        console=console, show_time=True, show_path=False, rich_tracebacks=True
+    )
+    _LOGGER.addHandler(handler)
+    _LOGGER.propagate = False
+    _LOGGER.setLevel(logging.INFO)
 
 
 def print_rank_0(message):
     """If distributed is initialized, print only on rank 0."""
     if torch.distributed.is_initialized():
-        now = datetime.now()
         if torch.distributed.get_rank() == 0:
-            print(f"[{now}] " + message, flush=True)
+            _LOGGER.info(message)
     else:
         print(message, flush=True)
@@ -34,11 +34,11 @@ def stringify_dict(input_dict, prefix="", sep=","):
             value.float()
             assert value.dim() == 0
             value = value.cpu().item()
-            output += key + ":" + f"{value:6f}{sep}"
+            output += key + ": " + f"{value:6f}{sep}"
         elif isinstance(value, float):
-            output += key + ":" + f"{value:6f}{sep}"
+            output += key + ": " + f"{value:6f}{sep}"
         elif isinstance(value, int):
-            output += key + ":" + f"{value}{sep}"
+            output += key + ": " + f"{value}{sep}"
         else:
             assert RuntimeError(f"stringify dict not supports type {type(value)}")
     # remove the ending sep
 
@@ -1,8 +1,9 @@
-# Examples: to demonstrate how to train generative recommendation models
+# Examples: to demonstrate how to do training and inference generative recommendation models
 
 ## Generative Recommender Introduction
 Meta's paper ["Actions Speak Louder Than Words"](https://arxiv.org/abs/2402.17152) introduces a novel paradigm for recommendation systems called **Generative Recommenders(GRs)**, which reformulates recommendation tasks as generative modeling problems. The work introduced Hierarchical Sequential Transduction Units (HSTU), a novel architecture designed to handle high-cardinality, non-stationary data streams in large-scale recommendation systems. HSTU enables both retrieval and ranking tasks. As noted in the paper, “HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users.”
-While **distributed-recommender** supports both retrieval and ranking use cases, in the following sections, we will guide you through the process of building a generative recommender for ranking tasks.
+
+In this example, we introduce the model architecture, training, and inference processes of HSTU. For more details, refer to the [training](./training/) and [inference](./inference/) entry folders, which include comprehensive guides and benchmark results.
 
 ## Ranking Model Introduction
 The model structure of the generative ranking model can be depicted by the following picture.
@@ -31,113 +32,10 @@ The HSTU block is a core component of the architecture, which modifies tradition
 ### Prediction Head
 The prediction head of the HSTU model employs a MLP network structure, enabling multi-task predictions. 
 
-## Parallelism for HSTU-based Generative Recommender
-Scaling is a crucial factor for HSTU-based GRs due to their demonstrated superior scalability compared to traditional Deep Learning Recommendation Models (DLRMs). According to the paper, while DLRMs plateau at around 200 billion parameters, GRs can scale up to 1.5 trillion parameters, resulting in improved model accuracy.
-
-However, achieving efficient scaling for GRs presents unique challenges. Existing libraries designed for large-scale training in LLMs or recommendation systems often fail to meet the specific needs of GRs:
-* **[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)**, which supports advanced parallelism (e.g Data, Tensor, Sequence, Pipeline, and Context parallelism), is not well-suited for recommendation systems due to their reliance on massive embedding tables that cannot be effectively handled by existing parallelism.
-* **[TorchRec](https://github.com/pytorch/torchrec)**, while providing solutions for sharding large embedding tables across GPUs, lacks robust support for dense model parallelism. This makes it difficult for users to combine embedding and dense parallelism without significant design effort
-
-To address these limitations, a hybrid approach combining sparse and dense parallelism is introduced as the pic shows.
-**TorchRec** is employed to shard large embedding tables effectively.
-**Megatron-Core** is used to support data and context parallelism for the dense components of the model. Please note that context parallelism is planned as part of future development.
-This integration ensures efficient training by coordinating sparse (embedding) and dense (context/data) parallelisms within a single model.
-![parallelism](./figs/parallelism.png)
-
-
-## Dataset Introduction
-
-We have supported several datasets as listed in the following sections:
-
-### Dataset Information
-#### **MovieLens**
-refer to [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) and [MovieLens 20M](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) for details.
-#### **KuaiRand**
-
-| dataset       | # users | seqlen max | seqlen min | seqlen mean | seqlen median | # items    |
-|---------------|---------|------------|------------|-------------|---------------|------------|
-| kuairand_pure | 27285   | 910        | 1          | 1           | 39            | 7551       |
-| kuairand_1k   | 1000    | 49332      | 10         | 5038        | 3379          | 4369953    |
-| kuairand_27k  | 27285   | 228000     | 100        | 11796       | 8591          | 32038725   |
- 
-refer to [KuaiRand](https://kuairand.com/) for details.
-
 ## Running the examples
 
-Before getting started, please make sure that all pre-requisites are fulfilled. You can refer to [Get Started][../../README] section in the root directory of the repo to set up the environment.
-
-### Dataset Preprocessing
-We provides preprocessor scripts to assist in downloading raw data if it is not already present. It processes the raw data into csv files.
-```bash
-mkdir -p ./tmp_data && python3 preprocessor.py --dataset_name <dataset-name>
-```
-The following dataset-name is supported:
-* ml-1m
-* ml-20m
-* kuairand-pure
-* kuairand-1k
-* kuairand-27k
-* all: preprocess all above datasets
-
-
-### Start training
-The entrypoint for training are `pretrain_gr_retrieval.py` or `pretrain_gr_ranking.py`. We use gin-config to specify the model structure, training arguments, hyper-params etc.
-To run retrieval task with `MovieLens 20m` dataset:
-
-```bash
-# Before running the `pretrain_gr_retrieval.py`, make sure that current working directory is `hstu`
-PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000  pretrain_gr_retrieval.py --gin-config-file movielen_retrieval.gin
-```
-
-To run ranking task with `MovieLens 20m` dataset:
-```bash
-# Before running the `pretrain_gr_ranking.py`, make sure that current working directory is `hstu`
-PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000  pretrain_gr_ranking.py --gin-config-file movielen_ranking.gin
-```
-
-## KVCache Manager for Inference
-
-### KVCache Usage
-
-1. KVCache Manager supports the following operations:
-* `get_user_kvdata_info`: to get current cached length and index of the first cached tokens in the history sequence
-* `prepare_kv_cache`: to allocate the required cache pages. The input history sequence need to be 
-* `paged_kvcache_ops.append_kvcache`: the cuda kernel to copy the `K, V` values into the allocated cache pages
-* `offload_kv_cache`: to offload the KV data from GPU KVCache to Host KV storage.
-* `evict_kv_cache`: to evict all the KV data in the KVCache Manager.
-
-2. Currently, the KVCache manager need to be access from a single thread.
-
-3. For different requests, the call to `get_user_kvdata_info` and `prepare_kv_cache` need to be in order and cannot be interleaved. Since the allocation in `prepare_kv_cache` may evict the cached data of other users, which changes the user kvdata_info.
-
-4. The KVCache manager does not support uncontinuous user history sequence as input from the same user. The overlapping tokens need to be removed before sending the sequence to the inference model. Doing the overrlapping removal in the upstream stage should be more performant than in the inference model.
-
-```
-[current KV data in cache] userId: 0, starting position: 0, cached length: 10
-[next input] {userId: 0, starting position: 10, length: 10}
-# Acceptable input
-
-[current KV data in cache] userId: 0, starting position: 0, cached length: 10
-[next input] {userId: 0, starting position: 20, length: 10}
-                         ^^^^^^^^^^^^^^^^^^^^^
-ERROR: The input sequence has missing tokens from 10 to 19 (both inclusive).
-
-[current KV data in cache] userId: 0, starting position: 0, cached length: 10
-[next input] {userId: 0, starting position: 5, length: 20}
-                         ^^^^^^^^^^^^^^^^^^^^^
-ERROR: The input sequence has overlapping tokens from 5 to 9 (both inclusive).
-```
-
-### Example: Kuairand-1K
-
-```
-~$ # Proprocess the dataset for inference:
-~$ python3 ./preprocessor.py --dataset_name "kuairand-1k" --inference
-~$
-~$ # Run the inference example
-~$ python3 ./inference_gr_ranking.py --gin_config_file ./kuairand_1k_inference_ranking.gin --checkpoint_dir ${PATH_TO_CHECKPOINT} --mode eval
-```
-
+* [HSTU training example](./training/)
+* [HSTU inference example](./inference/)
 
 # Acknowledgements