Skip to content

Commit 0964889

Browse files
authored
Merge pull request #53 from AI-Hypercomputer/nemo-a4x-llama405
Add llama3-1-405b 16node recipe on A4x
2 parents 074a722 + abc0e44 commit 0964889

File tree

9 files changed

+967
-0
lines changed

9 files changed

+967
-0
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4_jobset_workload
17+
description: a4_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain llama3-1-405b workloads on a4x GKE Node pools with Nvidia NeMo Framework
3+
4+
This recipe outlines the steps for running a llama3-1-405b pretraining
5+
workload on [a4x GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
6+
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
7+
8+
## Orchestration and deployment tools
9+
10+
For this recipe, the following setup is used:
11+
12+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Pretraining job configuration and deployment - A Helm chart is used to
14+
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
15+
[NeMo pretraining workload](https://github.com/NVIDIA/nemo).
16+
17+
## Test environment
18+
19+
This recipe has been optimized for and tested with the following configuration:
20+
21+
- GKE cluster
22+
Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x)
23+
to create your a4x GKE cluster.
24+
25+
> [NOTE]
26+
> **GKE version and workload placement**
27+
>
28+
> For GKE cluster versions `1.34.0-gke.1502000` and later, workload placement is mandatory. You must provide your own placement policy name. You can do this by editing `values.yaml` to set `workload.nodeSelector.cloud.google.com/placement-policy-name`
29+
>
30+
> For GKE cluster versions before `1.34.0-gke.1502000`, you can remove the `nodeSelector` section in `values.yaml`.
31+
32+
## Training dataset
33+
34+
This recipe uses a mock pretraining dataset provided by the NeMo framework.
35+
36+
## Docker container image
37+
38+
This recipe uses the following docker images:
39+
40+
- `nvcr.io/nvidia/nemo:25.07`
41+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.7`
42+
43+
## Run the recipe
44+
45+
From your client workstation, complete the following steps:
46+
47+
### Configure environment settings
48+
49+
Set the environment variables to match your environment:
50+
51+
```bash
52+
export PROJECT_ID=<PROJECT_ID>
53+
export CLUSTER_REGION=<CLUSTER_REGION>
54+
export CLUSTER_NAME=<CLUSTER_NAME>
55+
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
56+
export KUEUE_NAME=<KUEUE_NAME>
57+
```
58+
59+
Replace the following values:
60+
61+
- `<PROJECT_ID>`: your Google Cloud project ID.
62+
- `<CLUSTER_REGION>`: the region where your cluster is located.
63+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
64+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
65+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Make sure to verify the name of the local queue in your cluster.
66+
67+
Set the default project:
68+
69+
```bash
70+
gcloud config set project $PROJECT_ID
71+
```
72+
73+
### Get the recipe
74+
75+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
76+
77+
```
78+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
79+
cd gpu-recipes
80+
export REPO_ROOT=`git rev-parse --show-toplevel`
81+
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3-1-405b
82+
cd $RECIPE_ROOT
83+
```
84+
85+
### Get cluster credentials
86+
87+
```
88+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
89+
```
90+
91+
### Configure and submit a pretraining job
92+
93+
#### Using 16 node (64 gpus) fp8 precision
94+
To execute the job with the default settings, run the following command from
95+
your client:
96+
97+
```bash
98+
cd $RECIPE_ROOT
99+
export WORKLOAD_NAME=$USER-a4x-llama3-1-405b
100+
helm install $WORKLOAD_NAME . -f values.yaml \
101+
--set-file workload_launcher=launcher.sh \
102+
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \
103+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
104+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
105+
--set volumes.gcsMounts[0].mountPath=/job-logs \
106+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
107+
--set queue=${KUEUE_NAME}
108+
```
109+
110+
**Examples**
111+
112+
- To set the number of training steps to 100, run the following command from
113+
your client:
114+
115+
```bash
116+
cd $RECIPE_ROOT
117+
export WORKLOAD_NAME=$USER-a4x-llama3-1-405b
118+
helm install $WORKLOAD_NAME . -f values.yaml \
119+
--set-file workload_launcher=launcher.sh \
120+
--set-file workload_config=llama3-1-405b-fp8cs-gbs2048-gpus64.py \
121+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
122+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
123+
--set volumes.gcsMounts[0].mountPath=/job-logs \
124+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
125+
--set queue=${KUEUE_NAME} \
126+
--set workload.arguments[0]="trainer.max_steps=100"
127+
```
128+
129+
### Monitor the job
130+
131+
To check the status of pods in your job, run the following command:
132+
133+
```
134+
kubectl get pods | grep $USER-a4x-llama3-1-405b
135+
```
136+
137+
Replace the following:
138+
139+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-llama3-1-405b.
140+
141+
To get the logs for one of the pods, run the following command:
142+
143+
```
144+
kubectl logs POD_NAME
145+
```
146+
147+
Information about the training job's progress, including crucial details such as
148+
loss, step count, and step time, is generated by the rank 0 process.
149+
This process runs on the pod whose name begins with
150+
`JOB_NAME_PREFIX-workload-0-0`.
151+
For example: `$USER-a4x-llama3-1-405b-workload-0-0-s9zrv`.
152+
153+
### Uninstall the Helm release
154+
155+
You can delete the job and other resources created by the Helm chart. To
156+
uninstall Helm, run the following command from your client:
157+
158+
```bash
159+
helm uninstall $USER-a4x-llama3-1-405b
160+
```
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
usage()
2+
{
3+
cat << EOF
4+
usage: bash ./launcher.sh [config-override [config-override ...]]
5+
config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000.
6+
EOF
7+
}
8+
9+
parse_args() {
10+
while [ "$1" != "" ]; do
11+
case $(grep -o "=" <<< "$1" | wc -l) in
12+
1 )
13+
config_overrides+=("$1")
14+
;;
15+
* )
16+
echo "Invalid config override: $1"
17+
usage
18+
exit 1
19+
esac
20+
shift
21+
done
22+
config_overrides="${config_overrides[*]}"
23+
}
24+
25+
config_overrides=()
26+
parse_args "$@"
27+
28+
if [ -z "${config_overrides}" ]; then
29+
echo "No NeMo config overrides specified"
30+
else
31+
echo "NeMo config overrides:"
32+
echo " ${config_overrides}"
33+
fi
34+
35+
export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
36+
ldconfig $LD_LIBRARY_PATH
37+
echo "Added $LD_LIBRARY_PATH to ldconfig:"
38+
ldconfig -p | grep libcuda | sed 's/^/ /'
39+
echo ""
40+
41+
if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
42+
explicit_log_dir=${EXPLICIT_LOG_DIR}
43+
else
44+
explicit_log_dir=workload_logs
45+
fi
46+
echo "Logging to ${explicit_log_dir}"
47+
48+
if [[ -n "${TOKENIZER_PATH}" ]]; then
49+
echo "Getting tokenizer files"
50+
cp ${TOKENIZER_PATH}/* .
51+
echo ""
52+
fi
53+
54+
echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
55+
56+
57+
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
58+
59+
# Create the nsys directory.
60+
mkdir -p ${explicit_log_dir}/nsys
61+
62+
torchrun --no-python \
63+
--nproc-per-node="${GPUS_PER_NODE}" \
64+
--nnodes="${NNODES}" \
65+
--node_rank="${JOB_COMPLETION_INDEX}" \
66+
--rdzv_id="${JOB_IDENTIFIER}" \
67+
--master_addr="${MASTER_ADDR}" \
68+
--master_port="${MASTER_PORT}" \
69+
bash -c "numactl --cpunodebind=\$((LOCAL_RANK/2)) --membind=\$((LOCAL_RANK/2)) python ${NEMO_LAUNCH_SCRIPT} ${config_overrides}"
70+
71+
if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
72+
mkdir -p ${ARTIFACT_DIR}
73+
cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
74+
cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py
75+
cp dllogger.json ${ARTIFACT_DIR}/dllogger.json
76+
env > ${ARTIFACT_DIR}/environ.txt
77+
ls ${ARTIFACT_DIR}
78+
fi
79+
echo "Training completed"
80+
echo "Pod on $(hostname --fqdn) is exiting"

0 commit comments

Comments
 (0)