Skip to content

Distributed import: no bare cuda imports #914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: 1.1.0-rc
Choose a base branch
from

Conversation

coreyjadams
Copy link
Collaborator

This is meant prevent torch.cuda imports when cuda is not available.

PhysicsNeMo Pull Request

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

coreyjadams and others added 6 commits May 22, 2025 10:10
NVIDIA#901)

* mult-gpu training supported corrdiff optimization

* enable mixed precision for val

* clean codebase for opt

* add amp_mode aware model architecture

* add None checking for params

* revise datatype casting schema

* Add test cases for corrdiff optimizations

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* revised from_checkpoint, update tests and CHANGELOG

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Lint and format code properly

Signed-off-by: Neal Pan <nuochengp@nvidia.com>

* add multi-gpu optimization

* rebase changes and update tests and configs

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge ResidualLoss and refactored layer and Unet init based on PR review

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Update layers.py with robust apex import

* address incompatibility between dynamo and patching, retain same optimization perf w torch.compile

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update tests

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* initialize global_index directly on device

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* formatting

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix loss arguments in train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* merge songunetposembd with songuneyposltembd with index slicing (recompile issue persists)

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* fix small errors in songunet

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* revise positional_embedding_indexing to avoid recompile/graph break and with faster bw comparing to old version

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update changelog

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add back SongUNetPosLtEmbd class for better ckp loading

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* add forward in SongUnetLtPosEmbd and update train.py

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update comments for embedding_selector test for lt model

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update doctest

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* Added tiny detail in corrdiff readme

Signed-off-by: Charlelie Laurent <claurent@nvidia.com>

* minor update to arguments and docstring

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

---------

Signed-off-by: Neal Pan <nuochengp@nvidia.com>
Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Signed-off-by: Charlelie Laurent <claurent@nvidia.com>
Co-authored-by: Alicia Sui <asui@cw-pdx-cs-001-vscode-01.cm.cluster>
Co-authored-by: Neal Pan <nuochengp@nvidia.com>
Co-authored-by: Charlelie Laurent <84199758+CharlelieLrt@users.noreply.github.com>
Co-authored-by: Charlelie Laurent <claurent@nvidia.com>
* update lr_decay_rate to be configurable

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

* update lr_decay_rate comment

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>

---------

Signed-off-by: jialusui1102 <jialusui1102@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants