WIP: Adding TransformerEngine support for Communication GeMM overlap for Tensor Parallelism #1194

abhinavgoel95 · 2025-01-23T18:32:52Z

Description

This PR adds the API for a TransformerEngine API into MaxText. This feature enables communication-gemm overlap (AKA collective matmul).

This feature improves the performance of LLM training by ~10% when using Tensor Parallelism
More stable than XLA's pattern matcher and the custom call generates good schedules.

Tests

I have tested this change on Llama2 and Llama3 workloads so far.

python3 /opt/workspace/maxtext_fork/MaxText/train.py \
    /opt/workspace/maxtext_fork/MaxText/configs/base.yml \
    model_name=${MODEL} \
    per_device_batch_size=0.25 \
    steps=15 \
    scan_layers=true \
    monitor_goodput=false \
    enable_goodput_recording=false \
    remat_policy=minimal_flash \
    attention=cudnn_flash_te \
    max_target_length=4096 \
    use_iota_embed=true \
    logits_dot_in_fp32=false\
    enable_checkpointing=false \
    ici_data_parallelism=1 \
    ici_fsdp_parallelism=2 \
    ici_tensor_parallelism=1 \
    ici_tensor_sequence_parallelism=4 \
    base_output_directory=local_train \
    dataset_path=local \
    dataset_type=synthetic \
    hardware=gpu_mpi \
    comm_gemm_overlap=true

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

abhinavgoel95 · 2025-01-23T21:18:38Z

MaxText/configs/base.yml

+      "atomic_gemm": False,  # more performant when not using CUDA Graphs
+      "use_ce": True,  # ignored (always False) for "pipeline" method
+  },
+  "fc2_fprop": {


Avoid nested configs. Make it editable through the command line.

Apologies if you've already started to refactor - I think the nested config is fine if we expect the default settings are fixed, e.g. almost never any need to use something different than the default. If we expect we often want something different than the default then I would strongly prefer a easy way to override via CLI

They would not need to be tweaked after the model is set up. Thanks for the input, I will not make that change for now.

working te collective matmul

e843870

abhinavgoel95 mentioned this pull request Jan 23, 2025

WIP: Adding TransformerEngine support for Communication GeMM overlap for Tensor Parallelism #1193

Closed

4 tasks

gobbleturk assigned gobbleturk and yangyuwei Jan 23, 2025

abhinavgoel95 commented Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Adding TransformerEngine support for Communication GeMM overlap for Tensor Parallelism #1194

WIP: Adding TransformerEngine support for Communication GeMM overlap for Tensor Parallelism #1194

abhinavgoel95 commented Jan 23, 2025

abhinavgoel95 Jan 23, 2025

gobbleturk Jan 23, 2025

abhinavgoel95 Jan 25, 2025

WIP: Adding TransformerEngine support for Communication GeMM overlap for Tensor Parallelism #1194

Are you sure you want to change the base?

WIP: Adding TransformerEngine support for Communication GeMM overlap for Tensor Parallelism #1194

Conversation

abhinavgoel95 commented Jan 23, 2025

Description

Tests

Checklist

abhinavgoel95 Jan 23, 2025

Choose a reason for hiding this comment

gobbleturk Jan 23, 2025

Choose a reason for hiding this comment

abhinavgoel95 Jan 25, 2025

Choose a reason for hiding this comment