Skip to content

Commit 1b02037

Browse files
JacoCheungJacoCheung
authored andcommitted
Refactor training folder and docs (#203)
* Refactor training folder and docs * Move root RM env setting up to training * Move root ReadMe env setting up to training --------- Co-authored-by: JacoCheung <junzhang@nvidia.com>
1 parent c50bead commit 1b02037

File tree

16 files changed

+385
-100
lines changed

16 files changed

+385
-100
lines changed

README.md

Lines changed: 0 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -35,44 +35,6 @@ The project includes:
3535
</details>
3636
For more detailed release notes, please refer our [releases](https://github.com/NVIDIA/recsys-examples/releases).
3737

38-
# Environment Setup
39-
## Start from dockerfile
40-
41-
We provide [dockerfile](./docker/Dockerfile) for users to build environment.
42-
```
43-
docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
44-
```
45-
If you want to build image for Grace, you can use
46-
```
47-
docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
48-
```
49-
You can also set your own base image with args `--build-arg <BASE_IMAGE>`.
50-
51-
## Start from source file
52-
Before running examples, build and install libs under corelib following instruction in documentation:
53-
- [HSTU attention documentation](./corelib/hstu/README.md)
54-
- [Dynamic Embeddings documentation](./corelib/dynamicemb/README.md)
55-
56-
On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:
57-
58-
```bash
59-
pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
60-
```
61-
62-
If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code.
63-
64-
```bash
65-
git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
66-
pip install -e ./megatron-lm
67-
```
68-
69-
We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:
70-
71-
```bash
72-
cd /workspace/recsys-examples/examples/hstu && \
73-
python setup.py install
74-
```
75-
7638
# Get Started
7739
The examples we supported:
7840
- [HSTU recommender examples](./examples/hstu/README.md)

examples/commons/utils/logger.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,30 @@
1212
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
15-
from datetime import datetime
15+
import logging
1616

1717
import torch
18+
from rich.console import Console
19+
from rich.logging import RichHandler
20+
21+
# Set up logger with RichHandler if not already configured
22+
23+
console = Console()
24+
_LOGGER = logging.getLogger("rich_rank0")
25+
26+
if not _LOGGER.hasHandlers():
27+
handler = RichHandler(
28+
console=console, show_time=True, show_path=False, rich_tracebacks=True
29+
)
30+
_LOGGER.addHandler(handler)
31+
_LOGGER.propagate = False
32+
_LOGGER.setLevel(logging.INFO)
1833

1934

2035
def print_rank_0(message):
2136
"""If distributed is initialized, print only on rank 0."""
2237
if torch.distributed.is_initialized():
23-
now = datetime.now()
2438
if torch.distributed.get_rank() == 0:
25-
print(f"[{now}] " + message, flush=True)
39+
_LOGGER.info(message)
2640
else:
2741
print(message, flush=True)

examples/commons/utils/stringify.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,11 @@ def stringify_dict(input_dict, prefix="", sep=","):
3434
value.float()
3535
assert value.dim() == 0
3636
value = value.cpu().item()
37-
output += key + ":" + f"{value:6f}{sep}"
37+
output += key + ": " + f"{value:6f}{sep}"
3838
elif isinstance(value, float):
39-
output += key + ":" + f"{value:6f}{sep}"
39+
output += key + ": " + f"{value:6f}{sep}"
4040
elif isinstance(value, int):
41-
output += key + ":" + f"{value}{sep}"
41+
output += key + ": " + f"{value}{sep}"
4242
else:
4343
assert RuntimeError(f"stringify dict not supports type {type(value)}")
4444
# remove the ending sep

examples/hstu/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Examples: to demonstrate how to train generative recommendation models
1+
# Examples: to demonstrate how to do training and inference generative recommendation models
22

33
## Generative Recommender Introduction
44
Meta's paper ["Actions Speak Louder Than Words"](https://arxiv.org/abs/2402.17152) introduces a novel paradigm for recommendation systems called **Generative Recommenders(GRs)**, which reformulates recommendation tasks as generative modeling problems. The work introduced Hierarchical Sequential Transduction Units (HSTU), a novel architecture designed to handle high-cardinality, non-stationary data streams in large-scale recommendation systems. HSTU enables both retrieval and ranking tasks. As noted in the paper, “HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users.”

examples/hstu/training/README.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# HSTU Training example
2+
3+
We have supported both retrieval and ranking model whose backbones are HSTU layers. In this example collection, we allow user to specify the model structures via gin-config file. Supported datasets are listed below. Regarding the gin-config interface, please refer to [inline comments](../utils/gin_config_args.py) .
4+
5+
## Parallelism Introduction
6+
To facilitate large embedding tables and scaling-laws of HSTU dense, we have integrate **[TorchRec](https://github.com/pytorch/torchrec)** that does shard embedding tables and **[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)** that enable dense parallelism(e.g Data, Tensor, Sequence, Pipeline, and Context parallelism) in this example.
7+
This integration ensures efficient training by coordinating sparse (embedding) and dense (context/data) parallelisms within a single model.
8+
![parallelism](../figs/parallelism.png)
9+
10+
## Environment Setup
11+
### Start from dockerfile
12+
13+
We provide [dockerfile](../../../docker/Dockerfile) for users to build environment.
14+
```
15+
git clone https://github.com/NVIDIA/recsys-examples.git && cd recsys-examples
16+
docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
17+
```
18+
If you want to build image for Grace, you can use
19+
```
20+
git clone https://github.com/NVIDIA/recsys-examples.git && cd recsys-examples
21+
docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
22+
```
23+
You can also set your own base image with args `--build-arg <BASE_IMAGE>`.
24+
25+
### Start from source file
26+
Before running examples, build and install libs under corelib following instruction in documentation:
27+
- [HSTU attention documentation](.../../../corelib/hstu/README.md)
28+
- [Dynamic Embeddings documentation](.../../../corelib/dynamicemb/README.md)
29+
30+
On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:
31+
32+
```bash
33+
pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
34+
```
35+
36+
If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code.
37+
38+
```bash
39+
git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
40+
pip install -e ./megatron-lm
41+
```
42+
43+
We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:
44+
45+
```bash
46+
cd /workspace/recsys-examples/examples/hstu && \
47+
python setup.py install
48+
```
49+
### Dataset Introduction
50+
51+
We have supported several datasets as listed in the following sections:
52+
53+
### Dataset Information
54+
#### **MovieLens**
55+
refer to [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) and [MovieLens 20M](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) for details.
56+
#### **KuaiRand**
57+
58+
| dataset | # users | seqlen max | seqlen min | seqlen mean | seqlen median | # items |
59+
|---------------|---------|------------|------------|-------------|---------------|------------|
60+
| kuairand_pure | 27285 | 910 | 1 | 1 | 39 | 7551 |
61+
| kuairand_1k | 1000 | 49332 | 10 | 5038 | 3379 | 4369953 |
62+
| kuairand_27k | 27285 | 228000 | 100 | 11796 | 8591 | 32038725 |
63+
64+
refer to [KuaiRand](https://kuairand.com/) for details.
65+
66+
## Running the examples
67+
68+
Before getting started, please make sure that all pre-requisites are fulfilled. You can refer to [Get Started](../../../README) section in the root directory of the repo to set up the environment.
69+
70+
71+
### Dataset preprocessing
72+
73+
In order to prepare the dataset for training, you can use our `preprocessor.py` under the hstu example folder of the project.
74+
75+
```bash
76+
cd <root-to-repo>/examples/hstu &&
77+
mkdir -p ./tmp_data && python3 ./preprocessor.py --dataset_name <"ml-1m"|"ml-20m"|"kuairand-pure"|"kuairand-1k"|"kuairand-27k">
78+
79+
```
80+
81+
### Start training
82+
The entrypoint for training are `pretrain_gr_retrieval.py` or `pretrain_gr_ranking.py`. We use gin-config to specify the model structure, training arguments, hyper-params etc.
83+
84+
Command to run retrieval task with `MovieLens 20m` dataset:
85+
86+
```bash
87+
# Before running the `pretrain_gr_retrieval.py`, make sure that current working directory is `hstu`
88+
cd <root-to-project>examples/hstu
89+
PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000 ./training/pretrain_gr_retrieval.py --gin-config-file ./training/configs/movielen_retrieval.gin
90+
```
91+
92+
To run ranking task with `MovieLens 20m` dataset:
93+
```bash
94+
# Before running the `pretrain_gr_ranking.py`, make sure that current working directory is `hstu`
95+
cd <root-to-project>examples/hstu
96+
PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000 ./training/pretrain_gr_ranking.py --gin-config-file ./training/configs/movielen_ranking.gin
97+
```
98+
99+

examples/hstu/training/__init__.py

Lines changed: 0 additions & 2 deletions
This file was deleted.

examples/hstu/training/benchmark/README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ You can run script `run_hstu_benchmark.sh` to see the performance over the base
1313

1414
## How to run
1515

16-
The test entry is `python ./benchmark/hstu_layer_benchmark.py run`, you can type `python ./benchmark/hstu_layer_benchmark.py run --help` to get the input arguments. 4 important arguments are :
16+
The test entry is `python ./training/benchmark/hstu_layer_benchmark.py run`, you can type `python ./training/benchmark/hstu_layer_benchmark.py run --help` to get the input arguments. 4 important arguments are :
1717

1818
1. --kernel-backend: select the hstu mha backend. Could be `triton` or `cutlass`.
1919
2. --fuse-norm-mul-dropout: knob of `layer norm + multiplication + dropout ` fusion. Could be `False` or `True`
@@ -23,7 +23,9 @@ The test entry is `python ./benchmark/hstu_layer_benchmark.py run`, you can type
2323
Our baseline cmd example (1K):
2424

2525
```bash
26-
python ./benchmark/hstu_layer_benchmark.py run \
26+
27+
cd recsys-examples/examples/hstu
28+
python ./training/benchmark/hstu_layer_benchmark.py run \
2729
--iters 100 \
2830
--warmup-iters 50 \
2931
--layer-type native \
@@ -40,7 +42,8 @@ python ./benchmark/hstu_layer_benchmark.py run \
4042
You can also run a set of arguments with run.sh:
4143

4244
```bash
43-
bash run_hstu_layer_benchmark.sh <num_layers>
45+
cd recsys-examples/examples/hstu
46+
bash ./training/benchmark/run_hstu_layer_benchmark.sh <num_layers>
4447
```
4548

4649
After one run is done, a memory snapshot file in current working directory is generated, you can trace the memory usage with the file. Please refer to [PyTorch docs](https://docs.pytorch.org/docs/stable/torch_cuda_memory.html) on how to visualize the memory trace.

examples/hstu/training/benchmark/hstu_layer_benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
from modules.jagged_data import JaggedData
4848
from modules.native_hstu_layer import HSTULayer as NativeHSTULayer
4949
from ops.length_to_offsets import length_to_complete_offsets
50-
from training.utils import cal_flops_single_rank
50+
from training.trainer.utils import cal_flops_single_rank
5151

5252
_backend_str_to_type = {
5353
"cutlass": KernelBackend.CUTLASS,

examples/hstu/training/benchmark/run_hstu_layer_benchmark.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
3232
fi
3333
echo -e "\n\033[32mbaseline hstu layer \033[0m:"
3434
${nsys_profile_cmd/<placeholder>/${baseline_profile_name}} \
35-
python ./benchmark/hstu_layer_benchmark.py run \
35+
python ./training/benchmark/hstu_layer_benchmark.py run \
3636
--iters 100 \
3737
--warmup-iters 50 \
3838
--kernel-backend triton \
@@ -53,7 +53,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
5353

5454
echo -e "\n\033[32m +cutlass\033[0m:"
5555
${nsys_profile_cmd/<placeholder>/${cutlass_profile_name}} \
56-
python ./benchmark/hstu_layer_benchmark.py run \
56+
python ./training/benchmark/hstu_layer_benchmark.py run \
5757
--iters 100 \
5858
--warmup-iters 50 \
5959
--kernel-backend cutlass \
@@ -73,7 +73,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
7373

7474
echo -e "\n\033[32m +fused\033[0m:"
7575
${nsys_profile_cmd/<placeholder>/${fused_profile_name}} \
76-
python ./benchmark/hstu_layer_benchmark.py run \
76+
python ./training/benchmark/hstu_layer_benchmark.py run \
7777
--iters 100 \
7878
--warmup-iters 50 \
7979
--kernel-backend cutlass \
@@ -93,7 +93,7 @@ for dim_per_head in "${dim_per_heads[@]}"; do
9393

9494
echo -e "\n\033[32m + recompute\033[0m:"
9595
${nsys_profile_cmd/<placeholder>/${recompute_profile_name}} \
96-
python ./benchmark/hstu_layer_benchmark.py run \
96+
python ./training/benchmark/hstu_layer_benchmark.py run \
9797
--iters 100 \
9898
--warmup-iters 50 \
9999
--kernel-backend cutlass \

examples/hstu/training/pretrain_gr_ranking.py

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
warnings.filterwarnings("ignore", category=FutureWarning)
1919
warnings.filterwarnings("ignore", category=SyntaxWarning)
2020
import argparse
21-
from functools import partial # pylint: disable-unused-import
21+
from typing import List, Union
2222

2323
import commons.utils.initialize as init
2424
import gin
@@ -34,39 +34,33 @@
3434
JaggedMegatronTrainNonePipeline,
3535
JaggedMegatronTrainPipelineSparseDist,
3636
)
37-
from training import (
37+
from trainer.training import maybe_load_ckpts, train_with_pipeline
38+
from trainer.utils import (
3839
create_dynamic_optitons_dict,
3940
create_embedding_configs,
4041
create_hstu_config,
4142
create_optimizer_params,
4243
get_data_loader,
4344
get_dataset_and_embedding_args,
4445
get_embedding_vector_storage_multiplier,
45-
maybe_load_ckpts,
46-
train_with_pipeline,
4746
)
48-
from utils import (
47+
from utils import ( # from hstu.utils
48+
BenchmarkDatasetArgs,
49+
DatasetArgs,
50+
EmbeddingArgs,
4951
NetworkArgs,
5052
OptimizerArgs,
5153
RankingArgs,
5254
TensorModelParallelArgs,
5355
TrainerArgs,
5456
)
5557

56-
parser = argparse.ArgumentParser(
57-
description="Distributed GR Arguments", allow_abbrev=False
58-
)
59-
parser.add_argument("--gin-config-file", type=str)
60-
args = parser.parse_args()
61-
gin.parse_config_file(args.gin_config_file)
62-
trainer_args = TrainerArgs()
63-
dataset_args, embedding_args = get_dataset_and_embedding_args()
64-
network_args = NetworkArgs()
65-
optimizer_args = OptimizerArgs()
66-
tp_args = TensorModelParallelArgs()
67-
6858

69-
def create_ranking_config() -> RankingConfig:
59+
def create_ranking_config(
60+
dataset_args: Union[DatasetArgs, BenchmarkDatasetArgs],
61+
network_args: NetworkArgs,
62+
embedding_args: List[EmbeddingArgs],
63+
) -> RankingConfig:
7064
ranking_args = RankingArgs()
7165

7266
return RankingConfig(
@@ -82,6 +76,18 @@ def create_ranking_config() -> RankingConfig:
8276

8377

8478
def main():
79+
parser = argparse.ArgumentParser(
80+
description="HSTU Example Arguments", allow_abbrev=False
81+
)
82+
parser.add_argument("--gin-config-file", type=str)
83+
args = parser.parse_args()
84+
gin.parse_config_file(args.gin_config_file)
85+
trainer_args = TrainerArgs()
86+
dataset_args, embedding_args = get_dataset_and_embedding_args()
87+
network_args = NetworkArgs()
88+
optimizer_args = OptimizerArgs()
89+
tp_args = TensorModelParallelArgs()
90+
8591
init.initialize_distributed()
8692
init.initialize_model_parallel(
8793
tensor_model_parallel_size=tp_args.tensor_model_parallel_size
@@ -92,7 +98,7 @@ def main():
9298
f"distributed env initialization done. Free cuda memory: {free_memory / (1024 ** 2):.2f} MB"
9399
)
94100
hstu_config = create_hstu_config(network_args, tp_args)
95-
task_config = create_ranking_config()
101+
task_config = create_ranking_config(dataset_args, network_args, embedding_args)
96102
model = get_ranking_model(hstu_config=hstu_config, task_config=task_config)
97103

98104
dynamic_options_dict = create_dynamic_optitons_dict(

0 commit comments

Comments
 (0)