Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d] switch ProcessGroup::Work to be managed by intrusive_ptr #44046

Closed
wants to merge 40 commits into from

Conversation

wanchaol
Copy link
Contributor

@wanchaol wanchaol commented Sep 2, 2020

Stack from ghstack:

Differential Revision: D23632280

@dr-ci
Copy link

dr-ci bot commented Sep 2, 2020

💊 CI failures summary and remediations

As of commit 056232f (more details on the Dr. CI page):


  • 3/5 failures possibly* introduced in this PR
    • 1/3 non-CircleCI failure(s)
  • 2/5 broken upstream at merge base 69532c4 since Nov 10

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Nov 11 03:37:36 ERROR [2.722s]: test_reduce_sum_cuda_twice (__main__.TestDistBackendWithSpawn)
Nov 11 03:37:36 Traceback (most recent call last): 
Nov 11 03:37:36   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper 
Nov 11 03:37:36     self._join_processes(fn) 
Nov 11 03:37:36   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes 
Nov 11 03:37:36     self._check_return_codes(elapsed_time) 
Nov 11 03:37:36   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 431, in _check_return_codes 
Nov 11 03:37:36     raise RuntimeError(error) 
Nov 11 03:37:36 RuntimeError: Processes 0 1 exited with error code 10 
Nov 11 03:37:36  
Nov 11 03:37:36 ====================================================================== 
Nov 11 03:37:36 ERROR [2.722s]: test_reduce_sum_cuda_twice (__main__.TestDistBackendWithSpawn) 
Nov 11 03:37:36 ---------------------------------------------------------------------- 
Nov 11 03:37:36 Traceback (most recent call last): 
Nov 11 03:37:36   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper 
Nov 11 03:37:36     self._join_processes(fn) 
Nov 11 03:37:36   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes 
Nov 11 03:37:36     self._check_return_codes(elapsed_time) 
Nov 11 03:37:36   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 431, in _check_return_codes 
Nov 11 03:37:36     raise RuntimeError(error) 
Nov 11 03:37:36 RuntimeError: Processes 0 1 exited with error code 10 
Nov 11 03:37:36  

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Nov 11 03:04:20 ERROR 2020-11-11T02:56:38Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n ^\n" }
Nov 11 03:04:20 Traceback (most recent call last): 
Nov 11 03:04:20   File "test/run_test.py", line 867, in <module> 
Nov 11 03:04:20     main() 
Nov 11 03:04:20   File "test/run_test.py", line 850, in main 
Nov 11 03:04:20     raise RuntimeError(err_message) 
Nov 11 03:04:20 RuntimeError: distributed/test_distributed_fork failed! 
Nov 11 03:04:20 + cleanup 
Nov 11 03:04:20 + retcode=1 
Nov 11 03:04:20 + set +x 
Nov 11 03:04:20 =================== sccache compilation log =================== 
Nov 11 03:04:20 ERROR 2020-11-11T02:56:38Z: sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Nov 11 03:04:20  
Nov 11 03:04:20 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Nov 11 03:04:20 Compile requests                    151 
Nov 11 03:04:20 Compile requests executed            60 
Nov 11 03:04:20 Cache hits                           34 
Nov 11 03:04:20 Cache hits (C/C++)                   34 
Nov 11 03:04:20 Cache misses                         25 
Nov 11 03:04:20 Cache misses (C/C++)                 25 
Nov 11 03:04:20 Cache timeouts                        0 
Nov 11 03:04:20 Cache read errors                     0 

🚧 2 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet:


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 245 times.

@wanchaol wanchaol requested a review from mingzhe09088 as a code owner October 5, 2020 22:19
@@ -119,33 +119,33 @@ class ProcessGroup {
return size_;
}

virtual std::shared_ptr<ProcessGroup::Work> broadcast(
virtual c10::intrusive_ptr<ProcessGroup::Work> broadcast(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is probably going to break other third party backends like: https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.hpp#L136 and https://github.com/openucx/torch-ucc/blob/master/include/torch_ucc.hpp#L77.

I'm guessing this is necessary for TorchScript and there is no way around it, so should we ask the third-party libraries to make this change as well? (we can probably file issues on those repos).

cc @agolynski Since this affects the c10d extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed that we should ask them to make the changes. Do you have a list of third party backends or are these two the only two that's currently using c10d extension?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ones I am aware of are Intel and UCX. See:

  1. https://github.com/openucx/torch-ucc
  2. add c10d dynamic loading mechanism and unit test #28068

Could you please check with @agolynski, he might know more context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @chengjunlu @mshiryaev @Sergei-Lebedev @srinivas212

This PR will break master -> master dependency in torch-ccl and torch-ucc.

In near future we'll be changing ProcessGroup API which will break these repos as well. Would it be okay if you depend on 1.6 (upgrade to 1.7 when released) and not on master meanwhile?

@Sergei-Lebedev @srinivas212: How does torch-ucc depend on pytorch, do you require users to install from pytorch from master branch?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @agolynski @mrshenli @pritamdamania87 - after discussions with both Torch-UCC and Torch-CCL teams, the near term plan is to fix the issue in the third-party repo once this change lands.

In general, it is best we keep third-party plugins in sync w/ master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to land the stack soon, created openucx/torch-ucc#23 and intel/torch-ccl#11 to ucc and ccl to do the API migration.

@mrshenli mrshenli requested a review from agolynski October 8, 2020 16:23
@facebook-github-bot
Copy link
Contributor

@wanchaol merged this pull request in 0650a61.

@mrshenli
Copy link
Contributor

Hey @wanchaol, looks like this PR breaks master, shall we revert?

Nov 11 03:04:20 ======================================================================
Nov 11 03:04:20 ERROR [2.020s]: test_all_gather_multigpu_complex (__main__.TestDistBackendWithFork)
Nov 11 03:04:20 ----------------------------------------------------------------------
Nov 11 03:04:20 Traceback (most recent call last):
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 11 03:04:20     self._join_processes(fn)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 11 03:04:20     self._check_return_codes(elapsed_time)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 431, in _check_return_codes
Nov 11 03:04:20     raise RuntimeError(error)
Nov 11 03:04:20 RuntimeError: Processes 0 1 exited with error code 10
Nov 11 03:04:20 
Nov 11 03:04:20 ======================================================================
Nov 11 03:04:20 ERROR [1.920s]: test_reduce_sum_cuda (__main__.TestDistBackendWithFork)
Nov 11 03:04:20 ----------------------------------------------------------------------
Nov 11 03:04:20 Traceback (most recent call last):
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 11 03:04:20     self._join_processes(fn)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 11 03:04:20     self._check_return_codes(elapsed_time)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 431, in _check_return_codes
Nov 11 03:04:20     raise RuntimeError(error)
Nov 11 03:04:20 RuntimeError: Processes 0 1 exited with error code 10

@wanchaol
Copy link
Contributor Author

Hey @wanchaol, looks like this PR breaks master, shall we revert?

Nov 11 03:04:20 ======================================================================
Nov 11 03:04:20 ERROR [2.020s]: test_all_gather_multigpu_complex (__main__.TestDistBackendWithFork)
Nov 11 03:04:20 ----------------------------------------------------------------------
Nov 11 03:04:20 Traceback (most recent call last):
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 11 03:04:20     self._join_processes(fn)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 11 03:04:20     self._check_return_codes(elapsed_time)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 431, in _check_return_codes
Nov 11 03:04:20     raise RuntimeError(error)
Nov 11 03:04:20 RuntimeError: Processes 0 1 exited with error code 10
Nov 11 03:04:20 
Nov 11 03:04:20 ======================================================================
Nov 11 03:04:20 ERROR [1.920s]: test_reduce_sum_cuda (__main__.TestDistBackendWithFork)
Nov 11 03:04:20 ----------------------------------------------------------------------
Nov 11 03:04:20 Traceback (most recent call last):
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 278, in wrapper
Nov 11 03:04:20     self._join_processes(fn)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 395, in _join_processes
Nov 11 03:04:20     self._check_return_codes(elapsed_time)
Nov 11 03:04:20   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 431, in _check_return_codes
Nov 11 03:04:20     raise RuntimeError(error)
Nov 11 03:04:20 RuntimeError: Processes 0 1 exited with error code 10

yes just reverted it and will investigate and submit it again.

facebook-github-bot pushed a commit that referenced this pull request Nov 12, 2020
…tr (#47806)

Summary:
Pull Request resolved: #47806

reland #44046

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905245

fbshipit-source-id: ad75ace5432fcfd22d513878f5a73c4bb017324e
@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/128/head branch November 14, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants