Skip to content

Commit c50fa27

Browse files
geoffreyQiujunyiq-nvshijieliuJacoCheungJacoCheung
authored
Code reorganization for hstu training and inference (#202)
* Reorganize files for hstu training and inference * Update README files * Move gin files into dedicate configs folder * Fix python import error. Add consistency check doc * update hstu example readme * Refactor training folder and docs (#203) * Refactor training folder and docs * Move root RM env setting up to training * Move root ReadMe env setting up to training --------- Co-authored-by: JacoCheung <junzhang@nvidia.com> --------- Co-authored-by: Junyi Qiu <junyiq@nvidia.com> Co-authored-by: aleliu <aleliu@nvidia.com> Co-authored-by: Junzhang <32166257+JacoCheung@users.noreply.github.com> Co-authored-by: JacoCheung <junzhang@nvidia.com>
1 parent 15c79e9 commit c50fa27

38 files changed

+885
-638
lines changed

README.md

Lines changed: 0 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -35,44 +35,6 @@ The project includes:
3535
</details>
3636
For more detailed release notes, please refer our [releases](https://github.com/NVIDIA/recsys-examples/releases).
3737

38-
# Environment Setup
39-
## Start from dockerfile
40-
41-
We provide [dockerfile](./docker/Dockerfile) for users to build environment.
42-
```
43-
docker build -f docker/Dockerfile --platform linux/amd64 -t recsys-examples:latest .
44-
```
45-
If you want to build image for Grace, you can use
46-
```
47-
docker build -f docker/Dockerfile --platform linux/arm64 -t recsys-examples:latest .
48-
```
49-
You can also set your own base image with args `--build-arg <BASE_IMAGE>`.
50-
51-
## Start from source file
52-
Before running examples, build and install libs under corelib following instruction in documentation:
53-
- [HSTU attention documentation](./corelib/hstu/README.md)
54-
- [Dynamic Embeddings documentation](./corelib/dynamicemb/README.md)
55-
56-
On top of those two core libs, Megatron-Core along with other libs are required. You can install them via pypi package:
57-
58-
```bash
59-
pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath megatron-core==0.9.0
60-
```
61-
62-
If you fail to install the megatron-core package, usually due to the python version incompatibility, please try to clone and then install the source code.
63-
64-
```bash
65-
git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
66-
pip install -e ./megatron-lm
67-
```
68-
69-
We provide our custom HSTU CUDA operators for enhanced performance. You need to install these operators using the following command:
70-
71-
```bash
72-
cd /workspace/recsys-examples/examples/hstu && \
73-
python setup.py install
74-
```
75-
7638
# Get Started
7739
The examples we supported:
7840
- [HSTU recommender examples](./examples/hstu/README.md)

examples/commons/utils/logger.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,30 @@
1212
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
15-
from datetime import datetime
15+
import logging
1616

1717
import torch
18+
from rich.console import Console
19+
from rich.logging import RichHandler
20+
21+
# Set up logger with RichHandler if not already configured
22+
23+
console = Console()
24+
_LOGGER = logging.getLogger("rich_rank0")
25+
26+
if not _LOGGER.hasHandlers():
27+
handler = RichHandler(
28+
console=console, show_time=True, show_path=False, rich_tracebacks=True
29+
)
30+
_LOGGER.addHandler(handler)
31+
_LOGGER.propagate = False
32+
_LOGGER.setLevel(logging.INFO)
1833

1934

2035
def print_rank_0(message):
2136
"""If distributed is initialized, print only on rank 0."""
2237
if torch.distributed.is_initialized():
23-
now = datetime.now()
2438
if torch.distributed.get_rank() == 0:
25-
print(f"[{now}] " + message, flush=True)
39+
_LOGGER.info(message)
2640
else:
2741
print(message, flush=True)

examples/commons/utils/stringify.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,11 @@ def stringify_dict(input_dict, prefix="", sep=","):
3434
value.float()
3535
assert value.dim() == 0
3636
value = value.cpu().item()
37-
output += key + ":" + f"{value:6f}{sep}"
37+
output += key + ": " + f"{value:6f}{sep}"
3838
elif isinstance(value, float):
39-
output += key + ":" + f"{value:6f}{sep}"
39+
output += key + ": " + f"{value:6f}{sep}"
4040
elif isinstance(value, int):
41-
output += key + ":" + f"{value}{sep}"
41+
output += key + ": " + f"{value}{sep}"
4242
else:
4343
assert RuntimeError(f"stringify dict not supports type {type(value)}")
4444
# remove the ending sep

examples/hstu/README.md

Lines changed: 5 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
# Examples: to demonstrate how to train generative recommendation models
1+
# Examples: to demonstrate how to do training and inference generative recommendation models
22

33
## Generative Recommender Introduction
44
Meta's paper ["Actions Speak Louder Than Words"](https://arxiv.org/abs/2402.17152) introduces a novel paradigm for recommendation systems called **Generative Recommenders(GRs)**, which reformulates recommendation tasks as generative modeling problems. The work introduced Hierarchical Sequential Transduction Units (HSTU), a novel architecture designed to handle high-cardinality, non-stationary data streams in large-scale recommendation systems. HSTU enables both retrieval and ranking tasks. As noted in the paper, “HSTU-based GRs, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users.”
5-
While **distributed-recommender** supports both retrieval and ranking use cases, in the following sections, we will guide you through the process of building a generative recommender for ranking tasks.
5+
6+
In this example, we introduce the model architecture, training, and inference processes of HSTU. For more details, refer to the [training](./training/) and [inference](./inference/) entry folders, which include comprehensive guides and benchmark results.
67

78
## Ranking Model Introduction
89
The model structure of the generative ranking model can be depicted by the following picture.
@@ -31,113 +32,10 @@ The HSTU block is a core component of the architecture, which modifies tradition
3132
### Prediction Head
3233
The prediction head of the HSTU model employs a MLP network structure, enabling multi-task predictions.
3334

34-
## Parallelism for HSTU-based Generative Recommender
35-
Scaling is a crucial factor for HSTU-based GRs due to their demonstrated superior scalability compared to traditional Deep Learning Recommendation Models (DLRMs). According to the paper, while DLRMs plateau at around 200 billion parameters, GRs can scale up to 1.5 trillion parameters, resulting in improved model accuracy.
36-
37-
However, achieving efficient scaling for GRs presents unique challenges. Existing libraries designed for large-scale training in LLMs or recommendation systems often fail to meet the specific needs of GRs:
38-
* **[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)**, which supports advanced parallelism (e.g Data, Tensor, Sequence, Pipeline, and Context parallelism), is not well-suited for recommendation systems due to their reliance on massive embedding tables that cannot be effectively handled by existing parallelism.
39-
* **[TorchRec](https://github.com/pytorch/torchrec)**, while providing solutions for sharding large embedding tables across GPUs, lacks robust support for dense model parallelism. This makes it difficult for users to combine embedding and dense parallelism without significant design effort
40-
41-
To address these limitations, a hybrid approach combining sparse and dense parallelism is introduced as the pic shows.
42-
**TorchRec** is employed to shard large embedding tables effectively.
43-
**Megatron-Core** is used to support data and context parallelism for the dense components of the model. Please note that context parallelism is planned as part of future development.
44-
This integration ensures efficient training by coordinating sparse (embedding) and dense (context/data) parallelisms within a single model.
45-
![parallelism](./figs/parallelism.png)
46-
47-
48-
## Dataset Introduction
49-
50-
We have supported several datasets as listed in the following sections:
51-
52-
### Dataset Information
53-
#### **MovieLens**
54-
refer to [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) and [MovieLens 20M](https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) for details.
55-
#### **KuaiRand**
56-
57-
| dataset | # users | seqlen max | seqlen min | seqlen mean | seqlen median | # items |
58-
|---------------|---------|------------|------------|-------------|---------------|------------|
59-
| kuairand_pure | 27285 | 910 | 1 | 1 | 39 | 7551 |
60-
| kuairand_1k | 1000 | 49332 | 10 | 5038 | 3379 | 4369953 |
61-
| kuairand_27k | 27285 | 228000 | 100 | 11796 | 8591 | 32038725 |
62-
63-
refer to [KuaiRand](https://kuairand.com/) for details.
64-
6535
## Running the examples
6636

67-
Before getting started, please make sure that all pre-requisites are fulfilled. You can refer to [Get Started][../../README] section in the root directory of the repo to set up the environment.
68-
69-
### Dataset Preprocessing
70-
We provides preprocessor scripts to assist in downloading raw data if it is not already present. It processes the raw data into csv files.
71-
```bash
72-
mkdir -p ./tmp_data && python3 preprocessor.py --dataset_name <dataset-name>
73-
```
74-
The following dataset-name is supported:
75-
* ml-1m
76-
* ml-20m
77-
* kuairand-pure
78-
* kuairand-1k
79-
* kuairand-27k
80-
* all: preprocess all above datasets
81-
82-
83-
### Start training
84-
The entrypoint for training are `pretrain_gr_retrieval.py` or `pretrain_gr_ranking.py`. We use gin-config to specify the model structure, training arguments, hyper-params etc.
85-
To run retrieval task with `MovieLens 20m` dataset:
86-
87-
```bash
88-
# Before running the `pretrain_gr_retrieval.py`, make sure that current working directory is `hstu`
89-
PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000 pretrain_gr_retrieval.py --gin-config-file movielen_retrieval.gin
90-
```
91-
92-
To run ranking task with `MovieLens 20m` dataset:
93-
```bash
94-
# Before running the `pretrain_gr_ranking.py`, make sure that current working directory is `hstu`
95-
PYTHONPATH=${PYTHONPATH}:$(realpath ../) torchrun --nproc_per_node 1 --master_addr localhost --master_port 6000 pretrain_gr_ranking.py --gin-config-file movielen_ranking.gin
96-
```
97-
98-
## KVCache Manager for Inference
99-
100-
### KVCache Usage
101-
102-
1. KVCache Manager supports the following operations:
103-
* `get_user_kvdata_info`: to get current cached length and index of the first cached tokens in the history sequence
104-
* `prepare_kv_cache`: to allocate the required cache pages. The input history sequence need to be
105-
* `paged_kvcache_ops.append_kvcache`: the cuda kernel to copy the `K, V` values into the allocated cache pages
106-
* `offload_kv_cache`: to offload the KV data from GPU KVCache to Host KV storage.
107-
* `evict_kv_cache`: to evict all the KV data in the KVCache Manager.
108-
109-
2. Currently, the KVCache manager need to be access from a single thread.
110-
111-
3. For different requests, the call to `get_user_kvdata_info` and `prepare_kv_cache` need to be in order and cannot be interleaved. Since the allocation in `prepare_kv_cache` may evict the cached data of other users, which changes the user kvdata_info.
112-
113-
4. The KVCache manager does not support uncontinuous user history sequence as input from the same user. The overlapping tokens need to be removed before sending the sequence to the inference model. Doing the overrlapping removal in the upstream stage should be more performant than in the inference model.
114-
115-
```
116-
[current KV data in cache] userId: 0, starting position: 0, cached length: 10
117-
[next input] {userId: 0, starting position: 10, length: 10}
118-
# Acceptable input
119-
120-
[current KV data in cache] userId: 0, starting position: 0, cached length: 10
121-
[next input] {userId: 0, starting position: 20, length: 10}
122-
^^^^^^^^^^^^^^^^^^^^^
123-
ERROR: The input sequence has missing tokens from 10 to 19 (both inclusive).
124-
125-
[current KV data in cache] userId: 0, starting position: 0, cached length: 10
126-
[next input] {userId: 0, starting position: 5, length: 20}
127-
^^^^^^^^^^^^^^^^^^^^^
128-
ERROR: The input sequence has overlapping tokens from 5 to 9 (both inclusive).
129-
```
130-
131-
### Example: Kuairand-1K
132-
133-
```
134-
~$ # Proprocess the dataset for inference:
135-
~$ python3 ./preprocessor.py --dataset_name "kuairand-1k" --inference
136-
~$
137-
~$ # Run the inference example
138-
~$ python3 ./inference_gr_ranking.py --gin_config_file ./kuairand_1k_inference_ranking.gin --checkpoint_dir ${PATH_TO_CHECKPOINT} --mode eval
139-
```
140-
37+
* [HSTU training example](./training/)
38+
* [HSTU inference example](./inference/)
14139

14240
# Acknowledgements
14341

0 commit comments

Comments
 (0)