-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_amax_reduction_group' #6625
Comments
Looks like this is a bug from switching from Apex to Megatron-core (#6393). There was some recent MLPerf-related development in Apex that optimized AMAX reductions in Transformer Engine (NVIDIA/apex#1585, NVIDIA/apex#1597). Perhaps this belongs as its own issue, but I see more recent changes in Apex that haven't made their way to Megatron-core:
|
This doesn't look like it'll need any NeMo side changes. That said, this is obviously a regression since Apex supported it and we don't yet have it in Core, but should be a fairly straightforward resolution as soon as we have it there |
|
@erhoo82, thanks to @timmoon10 this is already in core now. |
Describe the bug
When running megatron_gpt_eval.py with an FP8 model, The FP8 path is run and not able to find
get_amax_reduction_group()
in parallel_state.The error is triggered at this line:
https://github.com/NVIDIA/NeMo/blob/21048627b3923c9268842990aafdef141bd14bd1/nemo/collections/nlp/modules/common/megatron/transformer.py#LL1432C70-L1432C70
Steps/Code to reproduce bug
Get a model trained with TE FP8
Run:
python megatron_gpt_eval.py gpt_model_file=models/5b_fp8_tp1.nemo
Expected behavior
The script is expected to finish and generate outputs.
Environment overview (please complete the following information)
Container: nvcr.io/nvidia/pytorch:23.03-py3
Call
./reinstall.sh
on main branch commit c3deeacdocker pull
&docker run
commands useddocker run --gpus '"device=0"' -it --ipc=host --ulimit memlock=-1 -v /home/scratch.yenshiw_sw/NeMo:/workspace/local-nemo --ulimit stack=67108864 nvcr.io/nvidia/pytorch:23.03-py3
Environment details
Additional context
I cannot find name
get_amax_reduction_group
in megatron source code (parallel_state):https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py
The text was updated successfully, but these errors were encountered: