-
Notifications
You must be signed in to change notification settings - Fork 405
Fix manager import with older pytorch (< 2.4.0) #905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…sions from compatibility issues.
/blossom-ci |
peterdsharpe
approved these changes
May 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
coreyjadams
added a commit
that referenced
this pull request
May 29, 2025
…buted applications (#906) * Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (#905) from compatibility issues. * Bumps torch version to >=2.4.0 to minimize support surface for distributed applications. * Adds changelog note * Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (#901) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <nuochengp@nvidia.com> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Lint and format code properly Signed-off-by: Neal Pan <nuochengp@nvidia.com> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update tests Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * initialize global_index directly on device Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * formatting Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix loss arguments in train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists) Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix small errors in songunet Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add back SongUNetPosLtEmbd class for better ckp loading Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add forward in SongUnetLtPosEmbd and update train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update comments for embedding_selector test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update doctest Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Added tiny detail in corrdiff readme Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * minor update to arguments and docstring Signed-off-by: jialusui1102 <jialusui1102@gmail.com> --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com> * Update CHANGELOG.md Fix lint error --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Corey adams <coreyjadams@gmail.com> Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
ktangsali
pushed a commit
that referenced
this pull request
May 29, 2025
…sions (#905) from compatibility issues.
ktangsali
pushed a commit
that referenced
this pull request
Jun 10, 2025
…buted applications (#906) * Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (#905) from compatibility issues. * Bumps torch version to >=2.4.0 to minimize support surface for distributed applications. * Adds changelog note * Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (#901) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <nuochengp@nvidia.com> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Lint and format code properly Signed-off-by: Neal Pan <nuochengp@nvidia.com> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update tests Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * initialize global_index directly on device Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * formatting Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix loss arguments in train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists) Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix small errors in songunet Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add back SongUNetPosLtEmbd class for better ckp loading Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add forward in SongUnetLtPosEmbd and update train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update comments for embedding_selector test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update doctest Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Added tiny detail in corrdiff readme Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * minor update to arguments and docstring Signed-off-by: jialusui1102 <jialusui1102@gmail.com> --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com> * Update CHANGELOG.md Fix lint error --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Corey adams <coreyjadams@gmail.com> Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
ktangsali
pushed a commit
that referenced
this pull request
Jun 10, 2025
…buted applications (#906) * Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (#905) from compatibility issues. * Bumps torch version to >=2.4.0 to minimize support surface for distributed applications. * Adds changelog note * Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (#901) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <nuochengp@nvidia.com> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Lint and format code properly Signed-off-by: Neal Pan <nuochengp@nvidia.com> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update tests Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * initialize global_index directly on device Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * formatting Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix loss arguments in train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists) Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix small errors in songunet Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add back SongUNetPosLtEmbd class for better ckp loading Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add forward in SongUnetLtPosEmbd and update train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update comments for embedding_selector test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update doctest Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Added tiny detail in corrdiff readme Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * minor update to arguments and docstring Signed-off-by: jialusui1102 <jialusui1102@gmail.com> --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com> * Update CHANGELOG.md Fix lint error --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Corey adams <coreyjadams@gmail.com> Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
coreyjadams
added a commit
to coreyjadams/physicsnemo
that referenced
this pull request
Aug 1, 2025
…buted applications (NVIDIA#906) * Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (NVIDIA#905) from compatibility issues. * Bumps torch version to >=2.4.0 to minimize support surface for distributed applications. * Adds changelog note * Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (NVIDIA#901) * mult-gpu training supported corrdiff optimization * enable mixed precision for val * clean codebase for opt * add amp_mode aware model architecture * add None checking for params * revise datatype casting schema * Add test cases for corrdiff optimizations Signed-off-by: Neal Pan <nuochengp@nvidia.com> * revised from_checkpoint, update tests and CHANGELOG Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Lint and format code properly Signed-off-by: Neal Pan <nuochengp@nvidia.com> * add multi-gpu optimization * rebase changes and update tests and configs Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge ResidualLoss and refactored layer and Unet init based on PR review Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Update layers.py with robust apex import * address incompatibility between dynamo and patching, retain same optimization perf w torch.compile Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update tests Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * initialize global_index directly on device Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * formatting Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix loss arguments in train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists) Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * fix small errors in songunet Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update changelog Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add back SongUNetPosLtEmbd class for better ckp loading Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * add forward in SongUnetLtPosEmbd and update train.py Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update comments for embedding_selector test for lt model Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * update doctest Signed-off-by: jialusui1102 <jialusui1102@gmail.com> * Added tiny detail in corrdiff readme Signed-off-by: Charlelie Laurent <claurent@nvidia.com> * minor update to arguments and docstring Signed-off-by: jialusui1102 <jialusui1102@gmail.com> --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com> * Update CHANGELOG.md Fix lint error --------- Signed-off-by: Neal Pan <nuochengp@nvidia.com> Signed-off-by: jialusui1102 <jialusui1102@gmail.com> Signed-off-by: Charlelie Laurent <claurent@nvidia.com> Co-authored-by: Corey adams <coreyjadams@gmail.com> Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com> Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster> Co-authored-by: Neal Pan <nuochengp@nvidia.com> Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com> Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Wrap DeviceMesh in quotes for typing hint, to protect older torch versions from compatibility issues.
(The function is protected already, but the type annotation was using a type that didn't exist in older torch.)
PhysicsNeMo Pull Request
Description
closes #904
Checklist
Dependencies