Skip to content

Fix manager import with older pytorch (< 2.4.0) #905

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 22, 2025

Conversation

coreyjadams
Copy link
Collaborator

Wrap DeviceMesh in quotes for typing hint, to protect older torch versions from compatibility issues.

(The function is protected already, but the type annotation was using a type that didn't exist in older torch.)

PhysicsNeMo Pull Request

Description

closes #904

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@coreyjadams
Copy link
Collaborator Author

/blossom-ci

Copy link
Collaborator

@peterdsharpe peterdsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coreyjadams coreyjadams merged commit be4f507 into NVIDIA:main May 22, 2025
1 check passed
coreyjadams added a commit that referenced this pull request May 29, 2025
…buted applications (#906)

* Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (#905)

from compatibility issues.

* Bumps torch version to >=2.4.0 to minimize support surface for distributed applications.

* Adds changelog note

* Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (#901)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Lint and format code properly

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update tests

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* formatting

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix loss arguments in train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists)

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix small errors in songunet

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add back SongUNetPosLtEmbd class for better ckp loading

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add forward in SongUnetLtPosEmbd and update train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update comments for embedding_selector test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update doctest

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Added tiny detail in corrdiff readme

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>

* minor update to arguments and docstring

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>

* Update CHANGELOG.md

Fix lint error

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Corey adams <coreyjadams@gmail.com>
Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
ktangsali pushed a commit that referenced this pull request May 29, 2025
@coreyjadams coreyjadams deleted the manager-typing-hotfix branch June 9, 2025 13:24
ktangsali pushed a commit that referenced this pull request Jun 10, 2025
…buted applications (#906)

* Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (#905)

from compatibility issues.

* Bumps torch version to >=2.4.0 to minimize support surface for distributed applications.

* Adds changelog note

* Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (#901)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Lint and format code properly

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update tests

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* formatting

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix loss arguments in train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists)

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix small errors in songunet

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add back SongUNetPosLtEmbd class for better ckp loading

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add forward in SongUnetLtPosEmbd and update train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update comments for embedding_selector test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update doctest

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Added tiny detail in corrdiff readme

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>

* minor update to arguments and docstring

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>

* Update CHANGELOG.md

Fix lint error

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Corey adams <coreyjadams@gmail.com>
Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
ktangsali pushed a commit that referenced this pull request Jun 10, 2025
…buted applications (#906)

* Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (#905)

from compatibility issues.

* Bumps torch version to >=2.4.0 to minimize support surface for distributed applications.

* Adds changelog note

* Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (#901)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Lint and format code properly

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update tests

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* formatting

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix loss arguments in train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists)

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix small errors in songunet

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add back SongUNetPosLtEmbd class for better ckp loading

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add forward in SongUnetLtPosEmbd and update train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update comments for embedding_selector test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update doctest

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Added tiny detail in corrdiff readme

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>

* minor update to arguments and docstring

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>

* Update CHANGELOG.md

Fix lint error

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Corey adams <coreyjadams@gmail.com>
Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
coreyjadams added a commit to coreyjadams/physicsnemo that referenced this pull request Aug 1, 2025
…buted applications (NVIDIA#906)

* Wrap DeviceMesh in quotes for typing hint, to protect older torch versions (NVIDIA#905)

from compatibility issues.

* Bumps torch version to >=2.4.0 to minimize support surface for distributed applications.

* Adds changelog note

* Merge SongUNetPosLtEmb with SongUNetPosEmb and add support for batch>1 (NVIDIA#901)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Lint and format code properly

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update tests

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* formatting

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix loss arguments in train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists)

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix small errors in songunet

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add back SongUNetPosLtEmbd class for better ckp loading

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add forward in SongUnetLtPosEmbd and update train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update comments for embedding_selector test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update doctest

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Added tiny detail in corrdiff readme

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>

* minor update to arguments and docstring

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>

* Update CHANGELOG.md

Fix lint error

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Corey adams <coreyjadams@gmail.com>
Co-authored-by: Jialu (Alicia) Sui <125910753+jialusui1102@users.noreply.github.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛[BUG]: physicsnemo.distributed requires PyTorch >=2.2.0, but pyproject.toml indicates torch>=2.0.0.
2 participants