-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start using ModelParallelConfig from Megatron Core #6885
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Very useful to collect configs into model parallel configs. See minor comments.
# hidden size is needed for pipeline schedules but is not currently in ModelParallelConfig | ||
setattr(model_parallel_config, 'hidden_size', self.cfg.hidden_size) | ||
except AttributeError: | ||
logging.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not also fail here? If missing and will fail later wouldn't here be a good place to stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this was too brittle. Maybe we can add a strict
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the suggestion?
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
Outdated
Show resolved
Hide resolved
""" Hidden size needs to be set from the cfg.encoder for the pipeline schedule. | ||
""" | ||
|
||
model_parallel_config = super().build_model_parallel_config() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't parent class return warning if hidden_size
is not in cfg.model.hidden_size
? Perhaps this argument can be passed to parent method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could expand more on your suggestion? I was adding this because the parent class didn't have hidden_size for this model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
The main concern I have is MPConfig vs TransformerConfig, we need to probably discuss more how we should structure the usages. Apart from that, this looks like it covers everything
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: eharper <eharper@nvidia.com>
3d974be
to
aa5b5fb
Compare
nemo/collections/nlp/models/language_modeling/megatron_bert_model.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!! just 1 potential issue with sequence length setting
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
Signed-off-by: eharper <eharper@nvidia.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think we should merge this in now
@michalivne: let us know what your feedback is on Eric's response, and we can send fixes in later PRs accordingly!
* start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * update module args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * remove optimizer_idx Signed-off-by: eharper <eharper@nvidia.com> * prefetch num microbatches Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: eharper <eharper@nvidia.com> * temporarily comment jenkins test Signed-off-by: eharper <eharper@nvidia.com> * update seq_length Signed-off-by: eharper <eharper@nvidia.com> * remove commented code Signed-off-by: eharper <eharper@nvidia.com> * update arg Signed-off-by: eharper <eharper@nvidia.com> * update mbs and gbs of test Signed-off-by: eharper <eharper@nvidia.com> * update batch size in test Signed-off-by: eharper <eharper@nvidia.com> * fix precision in test Signed-off-by: eharper <eharper@nvidia.com> * update precision Signed-off-by: eharper <eharper@nvidia.com> * move hidden_size out of conditional Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <complex451@gmail.com> Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@@ -16,6 +16,7 @@ | |||
|
|||
import pytest | |||
import torch | |||
from megatron.core import ModelParallelConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is breaking pytest --cpu
when doing a basic setup without all the fluff.
* start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * update module args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * remove optimizer_idx Signed-off-by: eharper <eharper@nvidia.com> * prefetch num microbatches Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: eharper <eharper@nvidia.com> * temporarily comment jenkins test Signed-off-by: eharper <eharper@nvidia.com> * update seq_length Signed-off-by: eharper <eharper@nvidia.com> * remove commented code Signed-off-by: eharper <eharper@nvidia.com> * update arg Signed-off-by: eharper <eharper@nvidia.com> * update mbs and gbs of test Signed-off-by: eharper <eharper@nvidia.com> * update batch size in test Signed-off-by: eharper <eharper@nvidia.com> * fix precision in test Signed-off-by: eharper <eharper@nvidia.com> * update precision Signed-off-by: eharper <eharper@nvidia.com> * move hidden_size out of conditional Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <complex451@gmail.com> Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: dorotat <dorotat@nvidia.com>
* start adding gpt from megatron core path Signed-off-by: ericharper <complex451@gmail.com> * set model parallel config Signed-off-by: ericharper <complex451@gmail.com> * use model parallel config object Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * set vp size to none if it is 1 Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * start updating to TransformerConfig Signed-off-by: ericharper <complex451@gmail.com> * add todo Signed-off-by: ericharper <complex451@gmail.com> * revert to model parallel config Signed-off-by: ericharper <complex451@gmail.com> * add hidden_size to model_parallel_config Signed-off-by: ericharper <complex451@gmail.com> * remove imports Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: ericharper <complex451@gmail.com> * small clean up Signed-off-by: ericharper <complex451@gmail.com> * update hidden size in peft base model, add mcore commit to jenkins Signed-off-by: ericharper <complex451@gmail.com> * update module args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add config obj to flash attention tests Signed-off-by: ericharper <complex451@gmail.com> * remove args Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove sequence parallel arg Signed-off-by: ericharper <complex451@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to self Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * update args Signed-off-by: ericharper <complex451@gmail.com> * add config to test Signed-off-by: ericharper <complex451@gmail.com> * get hidden_size from config Signed-off-by: ericharper <complex451@gmail.com> * add try except Signed-off-by: ericharper <complex451@gmail.com> * use default Signed-off-by: ericharper <complex451@gmail.com> * update config with hidden size Signed-off-by: ericharper <complex451@gmail.com> * remove arg Signed-off-by: ericharper <complex451@gmail.com> * comment out jenkins test Signed-off-by: ericharper <complex451@gmail.com> * revert import Signed-off-by: ericharper <complex451@gmail.com> * remove optimizer_idx Signed-off-by: eharper <eharper@nvidia.com> * prefetch num microbatches Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove import Signed-off-by: eharper <eharper@nvidia.com> * temporarily comment jenkins test Signed-off-by: eharper <eharper@nvidia.com> * update seq_length Signed-off-by: eharper <eharper@nvidia.com> * remove commented code Signed-off-by: eharper <eharper@nvidia.com> * update arg Signed-off-by: eharper <eharper@nvidia.com> * update mbs and gbs of test Signed-off-by: eharper <eharper@nvidia.com> * update batch size in test Signed-off-by: eharper <eharper@nvidia.com> * fix precision in test Signed-off-by: eharper <eharper@nvidia.com> * update precision Signed-off-by: eharper <eharper@nvidia.com> * move hidden_size out of conditional Signed-off-by: eharper <eharper@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ericharper <complex451@gmail.com> Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
This PR is adding the ModelParallelConfig arguments to be used with the next release of Megatron Core.
Collection: NLP
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information