-
Notifications
You must be signed in to change notification settings - Fork 297
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into memory_stats
- Loading branch information
Showing
96 changed files
with
4,282 additions
and
711 deletions.
There are no files selected for viewing
Validating CODEOWNERS rules …
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,2 @@ | ||
# Changes in this file should match with requiredReviewers in file .github/workflows/AddLabel.yml | ||
* @gobbleturk @jonb377 @khatwanimohit @bvandermoon @vipannalla | ||
* @gobbleturk @khatwanimohit @bvandermoon @vipannalla |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Description | ||
|
||
Start with a short description of what the PR does and how this is a change from | ||
the past. | ||
|
||
The rest of the description includes relevant details and context, examples: | ||
- why is this change being made, | ||
- the problem being solved and any relevant context, | ||
- why this is a good solution, | ||
- some information about the specific implementation, | ||
- shortcomings of the solution and possible future improvements. | ||
|
||
If the change fixes a bug or a Github issue, please include a link, e.g.,: | ||
FIXES: b/123456 | ||
FIXES: #123456 | ||
|
||
# Tests | ||
|
||
Please describe how you tested this change, and include any instructions and/or | ||
commands to reproduce. | ||
|
||
# Checklist | ||
|
||
Before submitting this PR, please make sure (put X in square brackets): | ||
- [ ] I have performed a self-review of my code. | ||
- [ ] I have necessary comments in my code, particularly in hard-to-understand areas. | ||
- [ ] I have run end-to-end tests tests and provided workload links above if applicable. | ||
- [ ] I have made or will make corresponding changes to the doc if needed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
name: Require Checklist | ||
on: | ||
pull_request: | ||
types: [opened, edited, synchronize] | ||
jobs: | ||
check_pr_body: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: mheap/require-checklist-action@v2 | ||
with: | ||
requireChecklist: true # If this is true and there are no checklists detected, the action will fail |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
echo "Running 128vm.sh" | ||
# Example command to invoke this script via XPK, assume you've installed xpk | ||
# COMMAND="bash MaxText/configs/a3/llama_3.1_405b/128vm.sh" | ||
# COMMAND='export LD_LIBRARY_PATH=/usr/local/cuda-12.6/compat:$LD_LIBRARY_PATH;'"${COMMAND}"; | ||
# | ||
# xpk workload create --project=${PROJECT}--cluster=${CLUSTER_NAME} --zone=${ZONE} \ | ||
# --workload=${WORKLOAD_NAME} --docker-image=gcr.io/supercomputer-testing/${LOCAL_IMAGE_NAME} \ | ||
# --device-type=${DEVICE_TYPE} --num-nodes=2 --priority=high \ | ||
# --command="$COMMAND" --env=XLA_FLAGS=$XLA_FLAGS | ||
|
||
# Stop execution if any command exits with error | ||
set -e | ||
|
||
export OUTPUT_PATH="gs://maxtext-experiments-multipod" | ||
export RUN_NAME="llama-31-128vm-$(date +%Y-%m-%d-%H-%M)" | ||
export EXECUTABLE="train.py" | ||
|
||
# Set environment variables | ||
for ARGUMENT in "$@"; do | ||
IFS='=' read -r KEY VALUE <<< "$ARGUMENT" | ||
export "$KEY"="$VALUE" | ||
done | ||
|
||
export XLA_FLAGS="--xla_dump_to=$OUTPUT_PATH/$RUN_NAME/HLO_dumps/ | ||
--xla_gpu_enable_latency_hiding_scheduler=true | ||
--xla_gpu_enable_triton_gemm=false --xla_gpu_graph_level=0 | ||
--xla_gpu_enable_highest_priority_async_stream=true | ||
--xla_gpu_all_reduce_combine_threshold_bytes=1073741824 --xla_gpu_all_gather_combine_threshold_bytes=134217728 | ||
--xla_gpu_reduce_scatter_combine_threshold_bytes=134217728 --xla_gpu_enable_pipelined_all_gather=true | ||
--xla_gpu_enable_pipelined_reduce_scatter=true --xla_gpu_enable_pipelined_all_reduce=true | ||
--xla_gpu_enable_while_loop_double_buffering=true --xla_gpu_enable_triton_softmax_fusion=false | ||
--xla_gpu_enable_all_gather_combine_by_dim=false --xla_gpu_enable_reduce_scatter_combine_by_dim=false | ||
--xla_disable_hlo_passes=rematerialization" | ||
|
||
# 128 nodes | ||
python MaxText/$EXECUTABLE MaxText/configs/models/llama3.1_405b.yml run_name=$RUN_NAME \ | ||
base_config=base.yml \ | ||
run_name=gpu_train_test \ | ||
hardware=gpu \ | ||
steps=10 \ | ||
model_name=llama3.1-405b \ | ||
enable_checkpointing: False \ | ||
attention=cudnn_flash_te \ | ||
remat_policy=full \ | ||
use_iota_embed=True \ | ||
scan_layers=True \ | ||
dataset_type=synthetic \ | ||
async_checkpointing=False \ | ||
logits_dot_in_fp32=False \ | ||
per_device_batch_size=1.0 \ | ||
max_target_length=8192 \ | ||
dcn_fsdp_parallelism=128 \ | ||
ici_fsdp_parallelism=8 \ | ||
base_output_directory=$OUTPUT_PATH \ | ||
profiler=xplane | ||
|
Oops, something went wrong.