Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 0 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,44 +35,6 @@ The project includes:
</details>
For more detailed release notes, please refer our [releases](https://github.com/NVIDIA/recsys-examples/releases).

# Environment Setup
## Start from dockerfile

We provide [dockerfile](./docker/Dockerfile) for users to build environment.
```
docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
```
If you want to build image for Grace, you can use
```
docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
```
You can also set your own base image with args `--build-arg <BASE_IMAGE>`.

## Start from source file
Before running examples, build and install libs under corelib following instruction in documentation:
- [HSTU attention documentation](./corelib/hstu/README.md)
- [Dynamic Embeddings documentation](./corelib/dynamicemb/README.md)

On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:

```bash
pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
```

If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code.

```bash
git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
pip install -e ./megatron-lm
```

We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:

```bash
cd /workspace/recsys-examples/examples/hstu && \
python setup.py install
```

# Get Started
The examples we supported:
- [HSTU recommender examples](./examples/hstu/README.md)
Expand Down
20 changes: 17 additions & 3 deletions examples/commons/utils/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,30 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from datetime import datetime
import logging

import torch
from rich.console import Console
from rich.logging import RichHandler

# Set up logger with RichHandler if not already configured

console = Console()
_LOGGER = logging.getLogger("rich_rank0")

if not _LOGGER.hasHandlers():
handler = RichHandler(
console=console, show_time=True, show_path=False, rich_tracebacks=True
)
_LOGGER.addHandler(handler)
_LOGGER.propagate = False
_LOGGER.setLevel(logging.INFO)


def print_rank_0(message):
"""If distributed is initialized, print only on rank 0."""
if torch.distributed.is_initialized():
now = datetime.now()
if torch.distributed.get_rank() == 0:
print(f"[{now}] " + message, flush=True)
_LOGGER.info(message)
else:
print(message, flush=True)
6 changes: 3 additions & 3 deletions examples/commons/utils/stringify.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ def stringify_dict(input_dict, prefix="", sep=","):
value.float()
assert value.dim() == 0
value = value.cpu().item()
output += key + ":" + f"{value:6f}{sep}"
output += key + ": " + f"{value:6f}{sep}"
elif isinstance(value, float):
output += key + ":" + f"{value:6f}{sep}"
output += key + ": " + f"{value:6f}{sep}"
elif isinstance(value, int):
output += key + ":" + f"{value}{sep}"
output += key + ": " + f"{value}{sep}"
else:
assert RuntimeError(f"stringify dict not supports type {type(value)}")
# remove the ending sep
Expand Down
2 changes: 1 addition & 1 deletion examples/hstu/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Examples: to demonstrate how to train generative recommendation models
# Examples: to demonstrate how to do training and inference generative recommendation models

## Generative Recommender Introduction
Meta's paper ["Actions Speak Louder Than Words"](https://arxiv.org/abs/2402.17152) introduces a novel paradigm for recommendation systems called **Generative Recommenders(GRs)**, which reformulates recommendation tasks as generative modeling problems. The work introduced Hierarchical Sequential Transduction Units (HSTU), a novel architecture designed to handle high-cardinality, non-stationary data streams in large-scale recommendation systems. HSTU enables both retrieval and ranking tasks. As noted in the paper, “HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users.”
Expand Down
99 changes: 99 additions & 0 deletions examples/hstu/training/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# HSTU Training example

We have supported both retrieval and ranking model whose backbones are HSTU layers. In this example collection, we allow user to specify the model structures via gin-config file. Supported datasets are listed below. Regarding the gin-config interface, please refer to [inline comments](../utils/gin_config_args.py) .

## Parallelism Introduction
To facilitate large embedding tables and scaling-laws of HSTU dense, we have integrate **[TorchRec](https://github.com/pytorch/torchrec)** that does shard embedding tables and **[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)** that enable dense parallelism(e.g Data, Tensor, Sequence, Pipeline, and Context parallelism) in this example.
This integration ensures efficient training by coordinating sparse (embedding) and dense (context/data) parallelisms within a single model.
![parallelism](../figs/parallelism.png)

## Environment Setup
### Start from dockerfile

We provide [dockerfile](../../../docker/Dockerfile) for users to build environment.
```
git clone https://github.com/NVIDIA/recsys-examples.git && cd recsys-examples
docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
```
If you want to build image for Grace, you can use
```
git clone https://github.com/NVIDIA/recsys-examples.git && cd recsys-examples
docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
```
You can also set your own base image with args `--build-arg <BASE_IMAGE>`.

### Start from source file
Before running examples, build and install libs under corelib following instruction in documentation:
- [HSTU attention documentation](.../../../corelib/hstu/README.md)
- [Dynamic Embeddings documentation](.../../../corelib/dynamicemb/README.md)

On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:

```bash
pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
```

If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code.

```bash
git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
pip install -e ./megatron-lm
```

We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:

```bash
cd /workspace/recsys-examples/examples/hstu && \
python setup.py install
```
### Dataset Introduction

We have supported several datasets as listed in the following sections:

### Dataset Information
#### **MovieLens**
refer to [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) and [MovieLens 20M](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) for details.
#### **KuaiRand**

| dataset | # users | seqlen max | seqlen min | seqlen mean | seqlen median | # items |
|---------------|---------|------------|------------|-------------|---------------|------------|
| kuairand_pure | 27285 | 910 | 1 | 1 | 39 | 7551 |
| kuairand_1k | 1000 | 49332 | 10 | 5038 | 3379 | 4369953 |
| kuairand_27k | 27285 | 228000 | 100 | 11796 | 8591 | 32038725 |

refer to [KuaiRand](https://kuairand.com/) for details.

## Running the examples

Before getting started, please make sure that all pre-requisites are fulfilled. You can refer to [Get Started](../../../README) section in the root directory of the repo to set up the environment.


### Dataset preprocessing

In order to prepare the dataset for training, you can use our `preprocessor.py` under the hstu example folder of the project.

```bash
cd <root-to-repo>/examples/hstu &&
mkdir -p ./tmp_data && python3 ./preprocessor.py --dataset_name <"ml-1m"|"ml-20m"|"kuairand-pure"|"kuairand-1k"|"kuairand-27k">

```

### Start training
The entrypoint for training are `pretrain_gr_retrieval.py` or `pretrain_gr_ranking.py`. We use gin-config to specify the model structure, training arguments, hyper-params etc.

Command to run retrieval task with `MovieLens 20m` dataset:

```bash
# Before running the `pretrain_gr_retrieval.py`, make sure that current working directory is `hstu`
cd <root-to-project>examples/hstu
PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000 ./training/pretrain_gr_retrieval.py --gin-config-file ./training/configs/movielen_retrieval.gin
```

To run ranking task with `MovieLens 20m` dataset:
```bash
# Before running the `pretrain_gr_ranking.py`, make sure that current working directory is `hstu`
cd <root-to-project>examples/hstu
PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000 ./training/pretrain_gr_ranking.py --gin-config-file ./training/configs/movielen_ranking.gin
```


2 changes: 0 additions & 2 deletions examples/hstu/training/__init__.py

This file was deleted.

9 changes: 6 additions & 3 deletions examples/hstu/training/benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ You can run script `run_hstu_benchmark.sh` to see the performance over the base

## How to run

The test entry is `python ./benchmark/hstu_layer_benchmark.py run`, you can type `python ./benchmark/hstu_layer_benchmark.py run --help` to get the input arguments. 4 important arguments are :
The test entry is `python ./training/benchmark/hstu_layer_benchmark.py run`, you can type `python ./training/benchmark/hstu_layer_benchmark.py run --help` to get the input arguments. 4 important arguments are :

1. --kernel-backend: select the hstu mha backend. Could be `triton` or `cutlass`.
2. --fuse-norm-mul-dropout: knob of `layer norm + multiplication + dropout ` fusion. Could be `False` or `True`
Expand All @@ -23,7 +23,9 @@ The test entry is `python ./benchmark/hstu_layer_benchmark.py run`, you can type
Our baseline cmd example (1K):

```bash
python ./benchmark/hstu_layer_benchmark.py run \

cd recsys-examples/examples/hstu
python ./training/benchmark/hstu_layer_benchmark.py run \
--iters 100 \
--warmup-iters 50 \
--layer-type native \
Expand All @@ -40,7 +42,8 @@ python ./benchmark/hstu_layer_benchmark.py run \
You can also run a set of arguments with run.sh:

```bash
bash run_hstu_layer_benchmark.sh <num_layers>
cd recsys-examples/examples/hstu
bash ./training/benchmark/run_hstu_layer_benchmark.sh <num_layers>
```

After one run is done, a memory snapshot file in current working directory is generated, you can trace the memory usage with the file. Please refer to [PyTorch docs](https://docs.pytorch.org/docs/stable/torch_cuda_memory.html) on how to visualize the memory trace.
Expand Down
2 changes: 1 addition & 1 deletion examples/hstu/training/benchmark/hstu_layer_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
from modules.jagged_data import JaggedData
from modules.native_hstu_layer import HSTULayer as NativeHSTULayer
from ops.length_to_offsets import length_to_complete_offsets
from training.utils import cal_flops_single_rank
from training.trainer.utils import cal_flops_single_rank

_backend_str_to_type = {
"cutlass": KernelBackend.CUTLASS,
Expand Down
8 changes: 4 additions & 4 deletions examples/hstu/training/benchmark/run_hstu_layer_benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
fi
echo -e "\n\033[32mbaseline hstu layer \033[0m:"
${nsys_profile_cmd/<placeholder>/${baseline_profile_name}} \
python ./benchmark/hstu_layer_benchmark.py run \
python ./training/benchmark/hstu_layer_benchmark.py run \
--iters 100 \
--warmup-iters 50 \
--kernel-backend triton \
Expand All @@ -53,7 +53,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do

echo -e "\n\033[32m +cutlass\033[0m:"
${nsys_profile_cmd/<placeholder>/${cutlass_profile_name}} \
python ./benchmark/hstu_layer_benchmark.py run \
python ./training/benchmark/hstu_layer_benchmark.py run \
--iters 100 \
--warmup-iters 50 \
--kernel-backend cutlass \
Expand All @@ -73,7 +73,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do

echo -e "\n\033[32m +fused\033[0m:"
${nsys_profile_cmd/<placeholder>/${fused_profile_name}} \
python ./benchmark/hstu_layer_benchmark.py run \
python ./training/benchmark/hstu_layer_benchmark.py run \
--iters 100 \
--warmup-iters 50 \
--kernel-backend cutlass \
Expand All @@ -93,7 +93,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do

echo -e "\n\033[32m + recompute\033[0m:"
${nsys_profile_cmd/<placeholder>/${recompute_profile_name}} \
python ./benchmark/hstu_layer_benchmark.py run \
python ./training/benchmark/hstu_layer_benchmark.py run \
--iters 100 \
--warmup-iters 50 \
--kernel-backend cutlass \
Expand Down
44 changes: 25 additions & 19 deletions examples/hstu/training/pretrain_gr_ranking.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=SyntaxWarning)
import argparse
from functools import partial # pylint: disable-unused-import
from typing import List, Union

import commons.utils.initialize as init
import gin
Expand All @@ -34,39 +34,33 @@
JaggedMegatronTrainNonePipeline,
JaggedMegatronTrainPipelineSparseDist,
)
from training import (
from trainer.training import maybe_load_ckpts, train_with_pipeline
from trainer.utils import (
create_dynamic_optitons_dict,
create_embedding_configs,
create_hstu_config,
create_optimizer_params,
get_data_loader,
get_dataset_and_embedding_args,
get_embedding_vector_storage_multiplier,
maybe_load_ckpts,
train_with_pipeline,
)
from utils import (
from utils import ( # from hstu.utils
BenchmarkDatasetArgs,
DatasetArgs,
EmbeddingArgs,
NetworkArgs,
OptimizerArgs,
RankingArgs,
TensorModelParallelArgs,
TrainerArgs,
)

parser = argparse.ArgumentParser(
description="Distributed GR Arguments", allow_abbrev=False
)
parser.add_argument("--gin-config-file", type=str)
args = parser.parse_args()
gin.parse_config_file(args.gin_config_file)
trainer_args = TrainerArgs()
dataset_args, embedding_args = get_dataset_and_embedding_args()
network_args = NetworkArgs()
optimizer_args = OptimizerArgs()
tp_args = TensorModelParallelArgs()


def create_ranking_config() -> RankingConfig:
def create_ranking_config(
dataset_args: Union[DatasetArgs, BenchmarkDatasetArgs],
network_args: NetworkArgs,
embedding_args: List[EmbeddingArgs],
) -> RankingConfig:
ranking_args = RankingArgs()

return RankingConfig(
Expand All @@ -82,6 +76,18 @@ def create_ranking_config() -> RankingConfig:


def main():
parser = argparse.ArgumentParser(
description="HSTU Example Arguments", allow_abbrev=False
)
parser.add_argument("--gin-config-file", type=str)
args = parser.parse_args()
gin.parse_config_file(args.gin_config_file)
trainer_args = TrainerArgs()
dataset_args, embedding_args = get_dataset_and_embedding_args()
network_args = NetworkArgs()
optimizer_args = OptimizerArgs()
tp_args = TensorModelParallelArgs()

init.initialize_distributed()
init.initialize_model_parallel(
tensor_model_parallel_size=tp_args.tensor_model_parallel_size
Expand All @@ -92,7 +98,7 @@ def main():
f"distributed env initialization done. Free cuda memory: {free_memory / (1024 ** 2):.2f} MB"
)
hstu_config = create_hstu_config(network_args, tp_args)
task_config = create_ranking_config()
task_config = create_ranking_config(dataset_args, network_args, embedding_args)
model = get_ranking_model(hstu_config=hstu_config, task_config=task_config)

dynamic_options_dict = create_dynamic_optitons_dict(
Expand Down
Loading