Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeepSpeed] Support TP=nGPU and PP=DP=1 #56

Merged
merged 1 commit into from
Feb 14, 2023
Merged

Conversation

comaniac
Copy link
Contributor

Description

For experimental purpose, this PR supports the case that TP=#GPUs. Specifically,

  1. After scheduling the model, set enable_pipeline=False so that we will use DeepSpeed engine with ZeRO-0 instead of the pipeline engine.
  2. In DeepSpeed dialect, we make sure the mpu with device topology is passed to DeepSpeed engine even when pipeline is disabled, so that DeepSpeed runtime can correctly configure data parallel groups.

Checklist

  • PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

cc @zarzen

@comaniac comaniac merged commit eaf17e1 into awslabs:main Feb 14, 2023
@comaniac comaniac deleted the tp-n branch February 14, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant