[DeepSpeed] Support TP=nGPU and PP=DP=1 #56

comaniac · 2023-02-14T00:21:42Z

Description

For experimental purpose, this PR supports the case that TP=#GPUs. Specifically,

After scheduling the model, set enable_pipeline=False so that we will use DeepSpeed engine with ZeRO-0 instead of the pipeline engine.
In DeepSpeed dialect, we make sure the mpu with device topology is passed to DeepSpeed engine even when pipeline is disabled, so that DeepSpeed runtime can correctly configure data parallel groups.

[DeepSpeed] Support TP=nGPU and PP=DP=1

a7b4f54

comaniac merged commit eaf17e1 into awslabs:main Feb 14, 2023

comaniac deleted the tp-n branch February 14, 2023 18:09