-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PGLE doesn't work for Tensor Parallelism #1005
Comments
@Tixxx do you know what the issue is? I'm trying to reproduce this issue myself still. |
I cannot access the screenshot above, it says page not found. Just a preliminary guess, the combiner threshold might introduce more data dependencies, so we usually tune it down if the collective is a combined one with a lot of data dependencies. I have tried reproing using your command on maxtext main, but the yaml file doesnt exist for me. Would you be able to share a smaller model that can be easily repro'd on a single node? Thanks |
Syncing in the chat.
…On Wed, Nov 6, 2024 at 4:46 PM TJ Xu ***@***.***> wrote:
I cannot access the screenshot above, it says page not found. Just a
preliminary guess, the combiner threshold might introduce more data
dependencies, so we usually tune it down if the collective is a combined
one with a lot of data dependencies.
I have tried reproing using your command on maxtext main, but the yaml
file doesnt exist for me. Would you be able to share a smaller model that
can be easily repro'd on a single node? Thanks
—
Reply to this email directly, view it on GitHub
<#1005 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEGX4AXCCKXWBAI5DKQXCLZ7KZ67AVCNFSM6AAAAABQ7L5FSCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRRGA4TGMZTHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Cheers,
Lance
|
We observed good overlap with FSDP + PGLE:
. Turning on and off PGLE makes a big difference here.
However, with TP + PGLE:
There is no performance improvements. Computation and communications are completely exposed.
Here is the command:
switch to lance-405b-clean branch
python3
MaxText/train.py MaxText/configs/models/gpu/llama3.1_405b.yml hardware=gpu
run_name=maxtext-llama3.1-405b steps=10 max_target_length=4096 model_name=llama3.1-405b
enable_checkpointing=false attention=cudnn_flash_te dataset_type=synthetic
async_checkpointing=false base_output_directory=gs://lancewang-dev-supercomputer-testing/maxtext_gpu
logits_dot_in_fp32=false use_iota_embed=true ici_tensor_parallelism=8 dcn_fsdp_parallelism=32
dcn_pipeline_parallelism=1 per_device_batch_size=1 num_layers_per_pipeline_stage=16 weight_dtype=bfloat16
remat_policy=save_qkv_proj profiler=xplane skip_first_n_steps_for_profiler=5
base_num_decoder_layers=126
Here are the xla flags:
--xla_gpu_graph_level=0 --xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false
--xla_gpu_enable_highest_priority_async_stream=true --xla_gpu_all_reduce_combine_threshold_bytes=536870912
--xla_gpu_all_gather_combine_threshold_bytes=536870912 --xla_gpu_reduce_scatter_combine_threshold_bytes=536870912
--xla_gpu_enable_pipelined_all_gather=true --xla_gpu_enable_pipelined_reduce_scatter=true
--xla_gpu_enable_pipelined_all_reduce=true --xla_gpu_enable_while_loop_double_buffering=true
--xla_disable_hlo_passes=rematerialization --xla_gpu_enable_pgle_accuracy_checker=false
--xla_gpu_enable_triton_softmax_fusion=false --xla_gpu_enable_all_gather_combine_by_dim=false
--xla_gpu_enable_reduce_scatter_combine_by_dim=false
Here are the env variable:
NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/usr/local/nvidia/lib64/a3plus_guest_config.textproto
NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS=600000
JAX_ENABLE_PGLE=true
JAX_REMOVE_CUSTOM_PARTITIONING_PTR_FROM_CACHE_KEY=true
JAX_DEBUG_LOG_MODULES=compiler
The image we built on Oct 22nd.
The text was updated successfully, but these errors were encountered: