-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tp pp conversion #6218
Support tp pp conversion #6218
Conversation
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
eebbd0b
to
c5f7a2e
Compare
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty cool PR. just have one minor comment.
examples/nlp/language_modeling/megatron_change_num_partitions_pp.py
Outdated
Show resolved
Hide resolved
examples/nlp/language_modeling/megatron_change_num_partitions_pp.py
Outdated
Show resolved
Hide resolved
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
…upport_tp_pp_conversion
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
examples/nlp/language_modeling/megatron_change_num_partitions_pp.py
Outdated
Show resolved
Hide resolved
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! just a couple of minor comments, everything else looks great!
also, can we add a PP change CI as well? would be helpful to keep testing that since the PR brings in global overrides that may cause issues if someone changes it |
Signed-off-by: smajumdar <titu1994@gmail.com>
Good point about jenkins test - updated old one from tp reduce and increase to jointly increase pp by even or odd number, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: smajumdar <titu1994@gmail.com>
* Add required flags to partially laod model Signed-off-by: smajumdar <titu1994@gmail.com> * Add cleaned up script for tp pp change Signed-off-by: smajumdar <titu1994@gmail.com> * Add cleaned up script for tp pp change Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to change parameter dtypes during conversion Signed-off-by: smajumdar <titu1994@gmail.com> * Add Debug Prints flag Signed-off-by: smajumdar <titu1994@gmail.com> * Improve error logs Signed-off-by: smajumdar <titu1994@gmail.com> * Fix issues with TP > 1 for Megatron T5 Signed-off-by: smajumdar <titu1994@gmail.com> * Finalize splitting of T5 models Signed-off-by: smajumdar <titu1994@gmail.com> * Update docstrings Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Finalize pp tp change for T5 models Signed-off-by: smajumdar <titu1994@gmail.com> * Fix CodeQL issue Signed-off-by: smajumdar <titu1994@gmail.com> * Fix dtype cast of num_gpu_per_node Signed-off-by: smajumdar <titu1994@gmail.com> * Update config Signed-off-by: smajumdar <titu1994@gmail.com> * Remove block for config checks Signed-off-by: smajumdar <titu1994@gmail.com> * Reduce shared embedding check for older configs Signed-off-by: smajumdar <titu1994@gmail.com> * Add support for extracted directory path Signed-off-by: smajumdar <titu1994@gmail.com> * Force CPU init for TP 1 PP 1 temp model Signed-off-by: smajumdar <titu1994@gmail.com> * Patch T5 models to init fully on CPU Signed-off-by: smajumdar <titu1994@gmail.com> * Update docstring Signed-off-by: smajumdar <titu1994@gmail.com> * Update docstring Signed-off-by: smajumdar <titu1994@gmail.com> * Update prints to logging Signed-off-by: smajumdar <titu1994@gmail.com> * Patch apex code Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Patch typo Signed-off-by: smajumdar <titu1994@gmail.com> * Fix import test of ModelType Signed-off-by: smajumdar <titu1994@gmail.com> * Add docstring comment for nlp override Signed-off-by: smajumdar <titu1994@gmail.com> * Merge new file with old file Signed-off-by: smajumdar <titu1994@gmail.com> * Update script call signature Signed-off-by: smajumdar <titu1994@gmail.com> * Remove comments Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update jenkins test Signed-off-by: smajumdar <titu1994@gmail.com> * Fix formatting Signed-off-by: smajumdar <titu1994@gmail.com> * Add open_dict hooks Signed-off-by: smajumdar <titu1994@gmail.com> * Fix unit test Signed-off-by: smajumdar <titu1994@gmail.com> * Fix unit test Signed-off-by: smajumdar <titu1994@gmail.com> * Retry in another directory Signed-off-by: smajumdar <titu1994@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert second test cause of shutil.rename error on CI Signed-off-by: smajumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com> Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
What does this PR do ?
Adds support for changing pipeline parallel version post construction for GPT
Collection: [Core, NLP]
Changelog
Usage
Before your PR is "Ready for review"
Pre checks:
PR Type: