Skip to content

[BUG] ZeRO optimizer with MoE Expert Parallelism #5618

@Jack47

Description

@Jack47

Describe the bug
Just like this PR: #5259 , ZeRO optimizer also needs to be fixed:

  1. partition logic of expert params.
image
  1. average_tensor used in gradient reduce in zero2
image

To Reproduce
Steps to reproduce the behavior:

use ep=4 and adamw optimizer to train llm

Expected behavior
expert gradients should be equal under ep=4 and ep=1, but currently it's 4 times bigger than ep=1

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions