-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Parallelism usage from Apex -> Megatron Core #6393
Conversation
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
* fix tests Signed-off-by: Yi Dong <yidong@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <yidong@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
* Support T5 with Megatron Core Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Remove comment Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Update prediction step Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Bug fixes from runs Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Address feedback Signed-off-by: SeanNaren <snarenthiran@nvidia.com> --------- Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
This PR was a massive effort. Thanks to all for their contributions and especially @aklife97 for putting it all together here.
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Abhinav Khattar <aklife97@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* import parallel_state and tensor_parallel from megatron.core Signed-off-by: ericharper <complex451@gmail.com> * update column parallel async allreduce arg Signed-off-by: ericharper <complex451@gmail.com> * typos Signed-off-by: ericharper <complex451@gmail.com> * play stash + some changes Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * make grad scaler callable Signed-off-by: ericharper <complex451@gmail.com> * Fixed formatting Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Make sure RETRO integrates well with the core (NVIDIA#6207) * fix tests Signed-off-by: Yi Dong <yidong@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <yidong@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [NLP] Support T5 with Megatron Core (NVIDIA#6222) * Support T5 with Megatron Core Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Remove comment Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Update prediction step Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Bug fixes from runs Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * Address feedback Signed-off-by: SeanNaren <snarenthiran@nvidia.com> --------- Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * GPT P-tuning core (max_len pad -> slow) Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * add GPT p-tuning w/ global batch based passing Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * add T5 p-tuning support Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * add megatron core install to Jenkinsfile Signed-off-by: ericharper <complex451@gmail.com> * fix command Signed-off-by: ericharper <complex451@gmail.com> * add guard efault for arg Signed-off-by: ericharper <complex451@gmail.com> * shift bert, retro, adapter + other namespace changes Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * build_model merge into one Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * Ensure fine-tuning/prompt learning work for T5 (NVIDIA#6385) Signed-off-by: SeanNaren <snarenthiran@nvidia.com> * rm extra split impl Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * fix for CI Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * temp change for tests Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * add bs=1 for log Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * fix Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * iter changes NMT Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * NMT partial fix Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * move on_train_batch_end to base_model Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * rm on_train_batch_end Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * temp remove NMT test Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * add training_step logic for T5 derived dynamic len models Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * add NMT test back Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * style fix Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * change no_async_tensor_model_parallel_allreduce Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * sequence_parallel_enabled -> sequence_parallel Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * fix T5 FT batch size Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * seq enabled Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * T5 sequence length fix Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * NMT mp fork to spawn Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * make function signatures consistent across models Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * make print log Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * rm unused import Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * update Dockerfile to install core Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * keep core path in workspace Signed-off-by: Abhinav Khattar <aklife97@gmail.com> --------- Signed-off-by: ericharper <complex451@gmail.com> Signed-off-by: Abhinav Khattar <aklife97@gmail.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Yi Dong <yidong@nvidia.com> Co-authored-by: ericharper <complex451@gmail.com> Co-authored-by: SeanNaren <snarenthiran@nvidia.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
What does this PR do ?
This PR moves model parallelism in NeMo to use Megatron-core instead of Apex.
We still use Apex for microbatch calculator, some enums/utils [both to be soon shifted to core], LayerNorm/other things with kernel
Collection: NLP
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information