Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix breaking changes for ONNX Runtime Training #122000

Closed

Conversation

ajindal1
Copy link
Contributor

@ajindal1 ajindal1 commented Mar 15, 2024

Fixes breaking changes for ONNX Runtime Training.

PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);

Copy link

pytorch-bot bot commented Mar 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122000

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 22aef51 with merge base dfc5e93 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ajindal1
Copy link
Contributor Author

@cyyever @Skylion007 any thoughts on this?

@cyyever
Copy link
Collaborator

cyyever commented Mar 16, 2024

@ajindal1 LGTM, hope that tests pass.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the need a duplicate implementation rather than do a const_cast in the header
(But also mark them deprecated so that there will be no use of such functions in Torch

Comment on lines 302 to 308
auto deleter = [src](void* self) {
if (src->deleter) {
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
src->deleter(const_cast<DLManagedTensor*>(src));
}
};
return fromDLPack(src, std::move(deleter));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not copy the code, just cost cast a top level wrapper

Suggested change
auto deleter = [src](void* self) {
if (src->deleter) {
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
src->deleter(const_cast<DLManagedTensor*>(src));
}
};
return fromDLPack(src, std::move(deleter));
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
return from DLPack(const_cast<DLManagedTensor*>(src));

@@ -13,8 +13,11 @@ namespace at {
TORCH_API ScalarType toScalarType(const DLDataType& dtype);
TORCH_API DLManagedTensor* toDLPack(const Tensor& src);
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
TORCH_API Tensor fromDLPack(const DLManagedTensor* src);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As those functions are semantically incorrect, let's mark them as deprecated an implement as headers to casts to a const variants

Suggested change
TORCH_API Tensor fromDLPack(const DLManagedTensor* src);
C10_DEPRECATED_MESSAGE("Please migrate to a non-const variant")
inline Tensor fromDLPack(const DLManagedTensor* src) { return fromDLPack(const_cast< DLManagedTensor*>(src); }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have modified just the header file as you suggested, which is minimal changes and makes sense to me. I have started a run on my end to verify it works. Will confirm it by EOD.

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 18, 2024
@cyyever
Copy link
Collaborator

cyyever commented Mar 19, 2024

@ajindal1 The parameter is const_casted in the function body so that it shouldn't be taken as const. Is it possible to patch aten_op_executor.cc to use the non-const version?

@ajindal1
Copy link
Contributor Author

@cyyever I don't completely understand your comment, can you please add more information. I am also unable to locate aten_op_executor.cc file you mentioned.

@wschin
Copy link
Collaborator

wschin commented Mar 19, 2024

@cyyever I don't completely understand your comment, can you please add more information. I am also unable to locate aten_op_executor.cc file you mentioned.

I guess this mean to call const_cast inside aten_op_executor.cc to match whatever the new behavior in PyTorch C++. Not sure if it's doable. On the other hand, I'd prefer PyTorch to maintain BC whenever possible.

@ajindal1 ajindal1 requested a review from malfet March 20, 2024 01:05
@ajindal1
Copy link
Contributor Author

ajindal1 commented Mar 20, 2024

@cyyever we have also added the const cast change in onnxruntime repository. However, for backward compatibility we will need this change here as well.

@cyyever
Copy link
Collaborator

cyyever commented Mar 20, 2024

@ajindal1 Thank you!

@ajindal1
Copy link
Contributor Author

@malfet does everything look good to you now?

@malfet malfet added this to the 2.3.0 milestone Mar 21, 2024
@malfet
Copy link
Contributor

malfet commented Mar 21, 2024

does everything look good to you now?

@ajindal1 if CI is green, yes, but I guess you'll need to add the clang-tidy suppression comment on top of the change, but otherwise LGTM

aten/src/ATen/DLConvertor.h Outdated Show resolved Hide resolved
@malfet
Copy link
Contributor

malfet commented Mar 21, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@malfet malfet added the topic: not user facing topic category label Mar 21, 2024
@malfet
Copy link
Contributor

malfet commented Mar 21, 2024

@pytorchbot merge -i

@malfet malfet added topic: bug fixes topic category release notes: cpp release notes category and removed topic: not user facing topic category labels Mar 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: .github/workflows/trunk.yml / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Details for Dev Infra team Raised by workflow job

@malfet
Copy link
Contributor

malfet commented Mar 21, 2024

@pytorchbbot merge -if "I've asked to ignore that failure, don't I?"

@malfet
Copy link
Contributor

malfet commented Mar 21, 2024

@pytorchbot merge -f "I've asked to ignore that failure, don't I?"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Apr 3, 2024

@pytorchbot cherry-pick

Copy link

pytorch-bot bot commented Apr 3, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: the following arguments are required: --onto, -c/--classification

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

@malfet
Copy link
Contributor

malfet commented Apr 3, 2024

@pytorchbot cherry-pick --onto release/2.3 --fixes "foobar" -c regression

pytorchbot pushed a commit that referenced this pull request Apr 3, 2024
Fixes breaking changes for ONNX Runtime Training.

PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: #122000
Approved by: https://github.com/malfet

(cherry picked from commit 765c3fc)
@pytorchbot
Copy link
Collaborator

Cherry picking #122000

The cherry pick PR is at #123271 and it is linked with issue foobar

Details for Dev Infra team Raised by workflow job

atalman pushed a commit that referenced this pull request Apr 3, 2024
Fixes breaking changes for ONNX Runtime Training.

PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: #122000
Approved by: https://github.com/malfet

(cherry picked from commit 765c3fc)

Co-authored-by: Abhishek Jindal <abjindal@microsoft.com>
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
Fixes breaking changes for ONNX Runtime Training.

PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: #122000
Approved by: https://github.com/malfet
@ajindal1
Copy link
Contributor Author

Validated with 2.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cpp release notes category topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants