Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash on unload torch cpu dll #67632

Closed
wants to merge 8 commits into from

Conversation

DBraun
Copy link

@DBraun DBraun commented Nov 1, 2021

Trying to rebase #61290 into latest pytorch:master

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel

@pytorch-probot
Copy link

pytorch-probot bot commented Nov 1, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/DBraun/pytorch/blob/ce15d35f55edb1be7f1f224bee0c7fbaea12bbab/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis ciflow/all, ciflow/linux, ciflow/mobile 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

Hi @DBraun!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Nov 1, 2021

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 61f3d60 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Nov 1, 2021
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@DBraun
Copy link
Author

DBraun commented Nov 2, 2021

Hi @ezyang, could you please help me get the Windows libtorch binaries? It looks like that kind of job wasn't triggered https://app.circleci.com/pipelines/github/pytorch/pytorch?branch=pull%2F67632

@DBraun
Copy link
Author

DBraun commented Nov 4, 2021

Hi @jamesr66a, since you reviewed a related PR, can you take a look at this too?

Copy link
Collaborator

@jamesr66a jamesr66a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom class related changes look fine to me. Adding @ilia-cher and @ZolotukhinM as reviewers for the profiler and tensorexpr changes respectively

@ilia-cher ilia-cher requested review from gdankel and removed request for ilia-cher November 5, 2021 02:14
@DBraun
Copy link
Author

DBraun commented Nov 12, 2021

Hi @ZolotukhinM and @gdankel , would it be possible to get the libtorch binaries for this PR, the same ones that I would have otherwise gotten from pytorch.org and selecting libtorch? For me, it's more urgent to try out the binaries than for this PR to be merged. Thank you.

@ZolotukhinM
Copy link

Fuser and tensorexpr changes look good to me as well!

would it be possible to get the libtorch binaries for this PR

You should be able to build pytorch libraries from sources: https://github.com/pytorch/pytorch#from-source

@DBraun
Copy link
Author

DBraun commented Nov 12, 2021

Thanks for reviewing and your suggestion. I was hoping to rely on cloud infrastructure to get the binaries but I'll weigh this against compiling it locally.

@ZolotukhinM
Copy link

Ah, I see what you meant. For each PR we run lots of CI jobs and build pytorch in different configurations in the process. Maybe we can download build artifacts somehow, but I'm not sure. @malfet do you know if it's possible?

@malfet
Copy link
Contributor

malfet commented Nov 12, 2021

@DBraun binary artifacts should be available for this PR from hud.pytorch.org/pr/67632

@@ -402,6 +402,23 @@ class class_ : public ::torch::detail::class_base {
registerCustomClassMethod(std::move(method));
return method_val;
}

// Wrapper function to force method deregistration on shutdown.
static void registerCustomClassMethod(std::unique_ptr<jit::Function> method) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just letting you know, I'm not the original author of this code, so feel free to edit as necessary.

@malfet
Copy link
Contributor

malfet commented Nov 12, 2021

Can you please add a unit-test that would crash before this change is introduced, but work fine afterwards?
Also, please note, that during dll unloading Windows runtime does not call destructors, see comment in

// Do not wait for termination of global threads on Windows
// Because CRT terminates DLL threads before calling
// global object destructors

@DBraun
Copy link
Author

DBraun commented Nov 12, 2021

It's pretty far out of my expertise, so I'm afraid I can't help much at this point.

Update: The binaries that I downloaded fixed the crash I was experiencing. Thanks again! I hope it can be merged into the main branch sometime.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 10, 2021
@ezyang ezyang removed the request for review from gdankel January 6, 2022 02:42
@osalpekar
Copy link
Member

I'm seeing a large number of test binaries crashing after this change. They seem to be primarily those using ASAN/TSAN, but also a handful that don't use such sanitizers:

Test appears to have passed but the binary exited with non-zero exit code -11.
This usually means something has crashed after the test was done.
Unfiltered output from the test binary:

Running main() from gmock_main.cc
Note: Google Test filter = ...
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from .../1, where TypeParam = float
[ RUN      ] .../1.NoIntersection
[       OK ] .../1.NoIntersection (0 ms)
[----------] 1 test from .../1 (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.

Can somebody take a closer look? Meta link here: D38306879. cc @malfet @ezyang @jamesr66a

@ezyang
Copy link
Contributor

ezyang commented Aug 2, 2022

@pytorchbot revert -c ghfirst "crashing in fbcode"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 2, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@ezyang
Copy link
Contributor

ezyang commented Aug 2, 2022

@pytorchbot revert -c ghfirst -m "crashing in fbcode"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

@DBraun your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Aug 2, 2022
This reverts commit a54c9a4.

Reverted #67632 on behalf of https://github.com/ezyang due to crashing in fbcode
@malfet
Copy link
Contributor

malfet commented Aug 2, 2022

I propose we revert and investigate in the followup PR... Also few suggestion to followup PR:

@ezyang
Copy link
Contributor

ezyang commented Aug 2, 2022

I thought the original PR's description was pretty good

@malfet
Copy link
Contributor

malfet commented Aug 2, 2022

@ezyang true, but original PR were never landed, was it?

jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Trying to rebase pytorch/pytorch#61290 into latest pytorch:master
Pull Request resolved: pytorch/pytorch#67632
Approved by: https://github.com/ezyang
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Trying to rebase pytorch/pytorch#61290 into latest pytorch:master
Pull Request resolved: pytorch/pytorch#67632
Approved by: https://github.com/ezyang
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
@omid-ek
Copy link

omid-ek commented Feb 24, 2023

This issue still persists in the latest release.
The crash can be reproduced with this simple snippet:

HMODULE h1 = LoadLibrary(L"torch_cpu.dll");
HMODULE h2 = LoadLibrary(L"torch_cuda_cu.dll");
FreeLibrary(h2);
FreeLibrary(h1);

@ezyang ezyang reopened this Feb 24, 2023
@linux-foundation-easycla
Copy link

CLA Missing ID CLA Not Signed

@github-actions github-actions bot added the NNC label Feb 24, 2023
@pytorch-bot pytorch-bot bot added the release notes: jit release notes category label Feb 24, 2023
@ezyang ezyang added the ezyang's list Stuff ezyang doesn't want to lose label Feb 25, 2023
@omid-ek
Copy link

omid-ek commented Feb 27, 2023

update: I built pytorch by applying these changes locally and the issue was not yet resolved.
How can I download the check builds for this PR to test them? When I click on one of the checks all I see is "The logs for this run have expired and are no longer available."

@ezyang
Copy link
Contributor

ezyang commented Feb 27, 2023

rebase the PR onto master and open a new PR

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jun 16, 2023
@ezyang ezyang removed the Stale label Jun 16, 2023
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Aug 15, 2023
@github-actions github-actions bot closed this Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed ezyang's list Stuff ezyang doesn't want to lose Merged NNC oncall: jit Add this issue/PR to JIT oncall triage queue open source release notes: jit release notes category Reverted Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.