AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_amax_reduction_group' #6625

yen-shi · 2023-05-10T20:52:24Z

Describe the bug

When running megatron_gpt_eval.py with an FP8 model, The FP8 path is run and not able to find get_amax_reduction_group() in parallel_state.

The error is triggered at this line:
https://github.com/NVIDIA/NeMo/blob/21048627b3923c9268842990aafdef141bd14bd1/nemo/collections/nlp/modules/common/megatron/transformer.py#LL1432C70-L1432C70

Steps/Code to reproduce bug

Get a model trained with TE FP8

Run:
python megatron_gpt_eval.py gpt_model_file=models/5b_fp8_tp1.nemo

Expected behavior

The script is expected to finish and generate outputs.

Environment overview (please complete the following information)

Environment location:
Container: nvcr.io/nvidia/pytorch:23.03-py3
Method of NeMo install:
Call ./reinstall.sh on main branch commit c3deeac
If method of install is [Docker], provide docker pull & docker run commands used
docker run --gpus '"device=0"' -it --ipc=host --ulimit memlock=-1 -v /home/scratch.yenshiw_sw/NeMo:/workspace/local-nemo --ulimit stack=67108864 nvcr.io/nvidia/pytorch:23.03-py3

Environment details

Additional context

I cannot find name get_amax_reduction_group in megatron source code (parallel_state):
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py

The text was updated successfully, but these errors were encountered:

timmoon10 · 2023-05-10T22:18:42Z

Looks like this is a bug from switching from Apex to Megatron-core (#6393). There was some recent MLPerf-related development in Apex that optimized AMAX reductions in Transformer Engine (NVIDIA/apex#1585, NVIDIA/apex#1597).

Perhaps this belongs as its own issue, but I see more recent changes in Apex that haven't made their way to Megatron-core:

Userbuffers support: Create mpi proc group for userbuffer communication kernels apex#1626
NCCL configuration for certain systems: [Transformer] Manipulate NCCL_NET for hybrid IB/Socket setups apex#1620
Overlapped pipeline communication with interleaved pipeline parallelism: p2p communication overlap at interleaved pipelining apex#1616

Pinging @erhoo82 @Aidyn-A @ksivaman

aklife97 · 2023-05-10T23:17:55Z

This doesn't look like it'll need any NeMo side changes.
Core does not have _AMAX_REDUCTION_GROUP that Apex does which we need to have to make fp8 work. We'd need to add this to Core which should directly enable it in NeMo.

That said, this is obviously a regression since Apex supported it and we don't yet have it in Core, but should be a fairly straightforward resolution as soon as we have it there

erhoo82 · 2023-05-25T16:28:43Z

_AMAX_REDUCTION_GROUP was added to merge the TP- and DP- reductions into a single communication call. This is only needed at FP8 training and should be added to Megatron-Core for compatibility with TE.

aklife97 · 2023-05-25T17:54:37Z

@erhoo82, thanks to @timmoon10 this is already in core now.
I'm waiting for #6627 to merge, and we should be able to close this issue

yen-shi added the bug Something isn't working label May 10, 2023

yen-shi mentioned this issue May 10, 2023

Add inference kv cache support for transformer TE path #6627

Merged

8 tasks

Aidyn-A mentioned this issue May 11, 2023

[Parallel State] Manipulate NCCL_NET for hybrid IB/Socket setups NVIDIA/Megatron-LM#341

Closed

timmoon10 mentioned this issue May 13, 2023

Tensor-parallel communication overlap with userbuffer backend #6444

Closed

8 tasks

This was referenced May 26, 2023

Debug Transformer Engine FP8 support with Megatron-core infrastructure #6739

Closed

Debug Transformer Engine FP8 support with Megatron-core infrastructure #6740

Merged

github-actions bot mentioned this issue Jun 1, 2023

Debug Transformer Engine FP8 support with Megatron-core infrastructure #6791

Merged

8 tasks

timmoon10 closed this as completed in #6791 Jun 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_amax_reduction_group' #6625

AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_amax_reduction_group' #6625

yen-shi commented May 10, 2023 •

edited

Loading

timmoon10 commented May 10, 2023 •

edited

Loading

aklife97 commented May 10, 2023

erhoo82 commented May 25, 2023

aklife97 commented May 25, 2023

AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_amax_reduction_group' #6625

AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_amax_reduction_group' #6625

Comments

yen-shi commented May 10, 2023 • edited Loading

timmoon10 commented May 10, 2023 • edited Loading

aklife97 commented May 10, 2023

erhoo82 commented May 25, 2023

aklife97 commented May 25, 2023

yen-shi commented May 10, 2023 •

edited

Loading

timmoon10 commented May 10, 2023 •

edited

Loading