Distributed optimizer support for multiple dtypes #1721

timmoon10 · 2023-09-01T18:46:52Z

This PR adds logic so that the parameters can be configured with different dtypes for the grad reduce-scatters and param all-gathers. I have two NeMo use-cases in mind:

For GPT, most grads can be reduced in BF16 but embedding grads need to be reduced in FP32 to avoid learning issues.
For FP8 support, weight matrices can be stored in FP8 while most other parameters (e.g. biases, layernorm params, embeddings) are in BF16. We would like to handle FP8 and BF16 param all-gathers in the same optimizer.

This also includes changes from #1719, which returns the state dict on all ranks and not just rank 0. We can either merge that first and rebase, or merge this and close #1719.

Rough draft. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

…checkpoint-allgather Signed-off-by: Tim Moon <tmoon@nvidia.com>

Handle case where we load old checkpoints without multi-dtype support Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

crcrpar

lgtm

apex/contrib/test/optimizers/test_dist_adam.py

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added 11 commits August 22, 2023 10:29

Distopt support for multiple dtypes

7764f42

Rough draft. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'master' into distopt-multi-dtype

8b81188

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add test for distopt with multiple dtypes

ef2cee1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug typos

f353db7

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Support for contiguous buffers with multiple dtypes

a9cc06c

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix bug when calling bf16 remainder kernel

e4b8554

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Return distopt checkpoint on all ranks

872bd5e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Tweak docs

1a7931d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'distopt-checkpoint-allgather' into distopt-multi-dtype-…

290e9fc

…checkpoint-allgather Signed-off-by: Tim Moon <tmoon@nvidia.com>

Black formatting

afaaf65

Handle case where we load old checkpoints without multi-dtype support Signed-off-by: Tim Moon <tmoon@nvidia.com>

Tweak logic for BF16 param remainder kernel

1f2490d

Signed-off-by: Tim Moon <tmoon@nvidia.com>

This was referenced Sep 1, 2023

Use distributed optimizer support for multiple dtypes NVIDIA/NeMo#7359

Merged

Distributed optimizer infrastructure for FP8 parameters #1723

Merged

crcrpar approved these changes Sep 6, 2023

View reviewed changes

apex/contrib/test/optimizers/test_dist_adam.py Outdated Show resolved Hide resolved

Skip distopt test if import fails

adb60b4

Signed-off-by: Tim Moon <tmoon@nvidia.com>

crcrpar merged commit 52e18c8 into NVIDIA:master Sep 6, 2023

timmoon10 deleted the distopt-multi-dtype-checkpoint-allgather branch September 11, 2023 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed optimizer support for multiple dtypes #1721

Distributed optimizer support for multiple dtypes #1721

timmoon10 commented Sep 1, 2023

crcrpar left a comment

Distributed optimizer support for multiple dtypes #1721

Distributed optimizer support for multiple dtypes #1721

Conversation

timmoon10 commented Sep 1, 2023

crcrpar left a comment

Choose a reason for hiding this comment