NVIDIA · shijieliu · Oct 30, 2025 · Oct 24, 2025 · Oct 27, 2025 · Oct 27, 2025
diff --git a/README.md b/README.md
@@ -35,44 +35,6 @@ The project includes:
 </details>
 For more detailed release notes, please refer our [releases](https://github.com/NVIDIA/recsys-examples/releases).
 
-# Environment Setup
-## Start from dockerfile
-
-We provide [dockerfile](./docker/Dockerfile) for users to build environment. 
-```
-docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
-```
-If you want to build image for Grace, you can use 
-```
-docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
-```
-You can also set your own base image with args `--build-arg <BASE_IMAGE>`.
-
-## Start from source file
-Before running examples, build and install libs under corelib following instruction in documentation:
-- [HSTU attention documentation](./corelib/hstu/README.md)
-- [Dynamic Embeddings documentation](./corelib/dynamicemb/README.md)
-
-On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:
-
-```bash
-pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
-```
-
-If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code. 
-
-```bash
-git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
-pip install -e ./megatron-lm
-```
-
-We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:
-
-```bash
-cd /workspace/recsys-examples/examples/hstu && \
-python setup.py install
-```
-
 # Get Started
 The examples we supported:
 - [HSTU recommender examples](./examples/hstu/README.md)

diff --git a/examples/commons/utils/logger.py b/examples/commons/utils/logger.py
@@ -12,16 +12,30 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from datetime import datetime
+import logging
 
 import torch
+from rich.console import Console
+from rich.logging import RichHandler
+
+# Set up logger with RichHandler if not already configured
+
+console = Console()
+_LOGGER = logging.getLogger("rich_rank0")
+
+if not _LOGGER.hasHandlers():
+    handler = RichHandler(
+        console=console, show_time=True, show_path=False, rich_tracebacks=True
+    )
+    _LOGGER.addHandler(handler)
+    _LOGGER.propagate = False
+    _LOGGER.setLevel(logging.INFO)
 
 
 def print_rank_0(message):
     """If distributed is initialized, print only on rank 0."""
     if torch.distributed.is_initialized():
-        now = datetime.now()
         if torch.distributed.get_rank() == 0:
-            print(f"[{now}] " + message, flush=True)
+            _LOGGER.info(message)
     else:
         print(message, flush=True)
diff --git a/examples/commons/utils/stringify.py b/examples/commons/utils/stringify.py
@@ -34,11 +34,11 @@ def stringify_dict(input_dict, prefix="", sep=","):
             value.float()
             assert value.dim() == 0
             value = value.cpu().item()
-            output += key + ":" + f"{value:6f}{sep}"
+            output += key + ": " + f"{value:6f}{sep}"
         elif isinstance(value, float):
-            output += key + ":" + f"{value:6f}{sep}"
+            output += key + ": " + f"{value:6f}{sep}"
         elif isinstance(value, int):
-            output += key + ":" + f"{value}{sep}"
+            output += key + ": " + f"{value}{sep}"
         else:
             assert RuntimeError(f"stringify dict not supports type {type(value)}")
     # remove the ending sep

diff --git a/examples/hstu/README.md b/examples/hstu/README.md
@@ -1,4 +1,4 @@
-# Examples: to demonstrate how to train generative recommendation models
+# Examples: to demonstrate how to do training and inference generative recommendation models
 
 ## Generative Recommender Introduction
 Meta's paper ["Actions Speak Louder Than Words"](https://arxiv.org/abs/2402.17152) introduces a novel paradigm for recommendation systems called **Generative Recommenders(GRs)**, which reformulates recommendation tasks as generative modeling problems. The work introduced Hierarchical Sequential Transduction Units (HSTU), a novel architecture designed to handle high-cardinality, non-stationary data streams in large-scale recommendation systems. HSTU enables both retrieval and ranking tasks. As noted in the paper, “HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users.”

diff --git a/examples/hstu/training/README.md b/examples/hstu/training/README.md
@@ -0,0 +1,99 @@
+# HSTU Training example
+
+We have supported both retrieval and ranking model whose backbones are HSTU layers. In this example collection, we allow user to specify the model structures via gin-config file. Supported datasets are listed below. Regarding the gin-config interface, please refer to [inline comments](../utils/gin_config_args.py) .
+
+## Parallelism Introduction 
+To facilitate large embedding tables and scaling-laws of HSTU dense, we have integrate **[TorchRec](https://github.com/pytorch/torchrec)** that does shard embedding tables and **[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)** that enable dense parallelism(e.g Data, Tensor, Sequence, Pipeline, and Context parallelism) in this example.
+This integration ensures efficient training by coordinating sparse (embedding) and dense (context/data) parallelisms within a single model.
+![parallelism](../figs/parallelism.png)
+
+## Environment Setup
+### Start from dockerfile
+
+We provide [dockerfile](../../../docker/Dockerfile) for users to build environment. 
+```
+git clone https://github.com/NVIDIA/recsys-examples.git && cd recsys-examples
+docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
+```
+If you want to build image for Grace, you can use 
+```
+git clone https://github.com/NVIDIA/recsys-examples.git && cd recsys-examples
+docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
+```
+You can also set your own base image with args `--build-arg <BASE_IMAGE>`.
+
+### Start from source file
+Before running examples, build and install libs under corelib following instruction in documentation:
+- [HSTU attention documentation](.../../../corelib/hstu/README.md)
+- [Dynamic Embeddings documentation](.../../../corelib/dynamicemb/README.md)
+
+On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:
+
+```bash
+pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
+```
+
+If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code. 
+
+```bash
+git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
+pip install -e ./megatron-lm
+```
+
+We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:
+
+```bash
+cd /workspace/recsys-examples/examples/hstu && \
+python setup.py install
+```
+### Dataset Introduction
+
+We have supported several datasets as listed in the following sections:
+
+### Dataset Information
+#### **MovieLens**
+refer to [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) and [MovieLens 20M](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) for details.
+#### **KuaiRand**
+
+| dataset       | # users | seqlen max | seqlen min | seqlen mean | seqlen median | # items    |
+|---------------|---------|------------|------------|-------------|---------------|------------|
+| kuairand_pure | 27285   | 910        | 1          | 1           | 39            | 7551       |
+| kuairand_1k   | 1000    | 49332      | 10         | 5038        | 3379          | 4369953    |
+| kuairand_27k  | 27285   | 228000     | 100        | 11796       | 8591          | 32038725   |
+
+refer to [KuaiRand](https://kuairand.com/) for details.
+
+## Running the examples
+
+Before getting started, please make sure that all pre-requisites are fulfilled. You can refer to [Get Started](../../../README) section in the root directory of the repo to set up the environment.
+
+
+### Dataset preprocessing
+
+In order to prepare the dataset for training, you can use our `preprocessor.py` under the hstu example folder of the project.
+
+```bash
+cd <root-to-repo>/examples/hstu && 
+mkdir -p ./tmp_data && python3 ./preprocessor.py --dataset_name <"ml-1m"|"ml-20m"|"kuairand-pure"|"kuairand-1k"|"kuairand-27k">
+
+```
+
+### Start training
+The entrypoint for training are `pretrain_gr_retrieval.py` or `pretrain_gr_ranking.py`. We use gin-config to specify the model structure, training arguments, hyper-params etc.
+
+Command to run retrieval task with `MovieLens 20m` dataset:
+
+```bash
+# Before running the `pretrain_gr_retrieval.py`, make sure that current working directory is `hstu`
+cd <root-to-project>examples/hstu 
+PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000  ./training/pretrain_gr_retrieval.py --gin-config-file ./training/configs/movielen_retrieval.gin
+```
+
+To run ranking task with `MovieLens 20m` dataset:
+```bash
+# Before running the `pretrain_gr_ranking.py`, make sure that current working directory is `hstu`
+cd <root-to-project>examples/hstu 
+PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000  ./training/pretrain_gr_ranking.py --gin-config-file ./training/configs/movielen_ranking.gin
+```
+
+
diff --git a/examples/hstu/training/__init__.py b/examples/hstu/training/__init__.py
diff --git a/examples/hstu/training/benchmark/README.md b/examples/hstu/training/benchmark/README.md
@@ -13,7 +13,7 @@ You can run script `run_hstu_benchmark.sh` to see the performance over the base
 
 ## How to run
 
-The test entry is `python ./benchmark/hstu_layer_benchmark.py run`, you can type `python ./benchmark/hstu_layer_benchmark.py run --help` to get the input arguments. 4 important arguments are :
+The test entry is `python ./training/benchmark/hstu_layer_benchmark.py run`, you can type `python ./training/benchmark/hstu_layer_benchmark.py run --help` to get the input arguments. 4 important arguments are :
 
 1. --kernel-backend: select the hstu mha backend. Could be `triton` or `cutlass`.
 2. --fuse-norm-mul-dropout: knob of  `layer norm + multiplication + dropout ` fusion. Could be `False` or `True`
@@ -23,7 +23,9 @@ The test entry is `python ./benchmark/hstu_layer_benchmark.py run`, you can type
 Our baseline cmd example (1K): 
 
 ```bash
-python ./benchmark/hstu_layer_benchmark.py run \
+
+cd recsys-examples/examples/hstu
+python ./training/benchmark/hstu_layer_benchmark.py run \
   --iters 100 \
   --warmup-iters 50 \
   --layer-type native \
@@ -40,7 +42,8 @@ python ./benchmark/hstu_layer_benchmark.py run \
 You can also run a set of arguments with run.sh:
 
 ```bash
-bash run_hstu_layer_benchmark.sh <num_layers>
+cd recsys-examples/examples/hstu
+bash ./training/benchmark/run_hstu_layer_benchmark.sh <num_layers>
 ```
 
 After one run is done, a memory snapshot file in current working directory is generated, you can trace the memory usage with the file. Please refer to [PyTorch docs](https://docs.pytorch.org/docs/stable/torch_cuda_memory.html) on how to visualize the memory trace.

diff --git a/examples/hstu/training/benchmark/hstu_layer_benchmark.py b/examples/hstu/training/benchmark/hstu_layer_benchmark.py
@@ -47,7 +47,7 @@
 from modules.jagged_data import JaggedData
 from modules.native_hstu_layer import HSTULayer as NativeHSTULayer
 from ops.length_to_offsets import length_to_complete_offsets
-from training.utils import cal_flops_single_rank
+from training.trainer.utils import cal_flops_single_rank
 
 _backend_str_to_type = {
     "cutlass": KernelBackend.CUTLASS,

diff --git a/examples/hstu/training/benchmark/run_hstu_layer_benchmark.sh b/examples/hstu/training/benchmark/run_hstu_layer_benchmark.sh
@@ -32,7 +32,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
                 fi
                 echo -e "\n\033[32mbaseline hstu layer \033[0m:"
                 ${nsys_profile_cmd/<placeholder>/${baseline_profile_name}} \
-                    python ./benchmark/hstu_layer_benchmark.py run \
+                    python ./training/benchmark/hstu_layer_benchmark.py run \
                     --iters 100 \
                     --warmup-iters 50 \
                     --kernel-backend triton \
@@ -53,7 +53,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
 
                 echo -e "\n\033[32m +cutlass\033[0m:"
                 ${nsys_profile_cmd/<placeholder>/${cutlass_profile_name}} \
-                    python ./benchmark/hstu_layer_benchmark.py run \
+                    python ./training/benchmark/hstu_layer_benchmark.py run \
                     --iters 100 \
                     --warmup-iters 50 \
                     --kernel-backend cutlass \
@@ -73,7 +73,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
 
                 echo -e "\n\033[32m +fused\033[0m:"
                 ${nsys_profile_cmd/<placeholder>/${fused_profile_name}} \
-                    python ./benchmark/hstu_layer_benchmark.py run \
+                    python ./training/benchmark/hstu_layer_benchmark.py run \
                     --iters 100 \
                     --warmup-iters 50 \
                     --kernel-backend cutlass \
@@ -93,7 +93,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
 
                 echo -e "\n\033[32m + recompute\033[0m:"
                 ${nsys_profile_cmd/<placeholder>/${recompute_profile_name}} \
-                    python ./benchmark/hstu_layer_benchmark.py run \
+                    python ./training/benchmark/hstu_layer_benchmark.py run \
                     --iters 100 \
                     --warmup-iters 50 \
                     --kernel-backend cutlass \

diff --git a/examples/hstu/training/pretrain_gr_ranking.py b/examples/hstu/training/pretrain_gr_ranking.py
@@ -18,7 +18,7 @@
 warnings.filterwarnings("ignore", category=FutureWarning)
 warnings.filterwarnings("ignore", category=SyntaxWarning)
 import argparse
-from functools import partial  # pylint: disable-unused-import
+from typing import List, Union
 
 import commons.utils.initialize as init
 import gin
@@ -34,39 +34,33 @@
     JaggedMegatronTrainNonePipeline,
     JaggedMegatronTrainPipelineSparseDist,
 )
-from training import (
+from trainer.training import maybe_load_ckpts, train_with_pipeline
+from trainer.utils import (
     create_dynamic_optitons_dict,
     create_embedding_configs,
     create_hstu_config,
     create_optimizer_params,
     get_data_loader,
     get_dataset_and_embedding_args,
     get_embedding_vector_storage_multiplier,
-    maybe_load_ckpts,
-    train_with_pipeline,
 )
-from utils import (
+from utils import (  # from hstu.utils
+    BenchmarkDatasetArgs,
+    DatasetArgs,
+    EmbeddingArgs,
     NetworkArgs,
     OptimizerArgs,
     RankingArgs,
     TensorModelParallelArgs,
     TrainerArgs,
 )
 
-parser = argparse.ArgumentParser(
-    description="Distributed GR Arguments", allow_abbrev=False
-)
-parser.add_argument("--gin-config-file", type=str)
-args = parser.parse_args()
-gin.parse_config_file(args.gin_config_file)
-trainer_args = TrainerArgs()
-dataset_args, embedding_args = get_dataset_and_embedding_args()
-network_args = NetworkArgs()
-optimizer_args = OptimizerArgs()
-tp_args = TensorModelParallelArgs()
-
 
-def create_ranking_config() -> RankingConfig:
+def create_ranking_config(
+    dataset_args: Union[DatasetArgs, BenchmarkDatasetArgs],
+    network_args: NetworkArgs,
+    embedding_args: List[EmbeddingArgs],
+) -> RankingConfig:
     ranking_args = RankingArgs()
 
     return RankingConfig(
@@ -82,6 +76,18 @@ def create_ranking_config() -> RankingConfig:
 
 
 def main():
+    parser = argparse.ArgumentParser(
+        description="HSTU Example Arguments", allow_abbrev=False
+    )
+    parser.add_argument("--gin-config-file", type=str)
+    args = parser.parse_args()
+    gin.parse_config_file(args.gin_config_file)
+    trainer_args = TrainerArgs()
+    dataset_args, embedding_args = get_dataset_and_embedding_args()
+    network_args = NetworkArgs()
+    optimizer_args = OptimizerArgs()
+    tp_args = TensorModelParallelArgs()
+
     init.initialize_distributed()
     init.initialize_model_parallel(
         tensor_model_parallel_size=tp_args.tensor_model_parallel_size
@@ -92,7 +98,7 @@ def main():
         f"distributed env initialization done. Free cuda memory: {free_memory / (1024 ** 2):.2f} MB"
     )
     hstu_config = create_hstu_config(network_args, tp_args)
-    task_config = create_ranking_config()
+    task_config = create_ranking_config(dataset_args, network_args, embedding_args)
     model = get_ranking_model(hstu_config=hstu_config, task_config=task_config)
 
     dynamic_options_dict = create_dynamic_optitons_dict(