heter for collective #37613

kuizhiqing · 2021-11-26T10:59:15Z

PR types

New features

PR changes

Others

Describe

Heterogenous mix training represents the model training with heterogenous hardwares. Dygraph mode is only supported now. GPU/NPU/XPU are targeting devices for this prototype work.

The basic idea is very similar as the use of hierarchical communication topology. The low layer reduce the data within each node, while the upper layer reduce across all global nodes.

paddle-bot-old · 2021-11-26T10:59:20Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

sandyhouse

看着都是动态图的，不能支持静态图吧？

sandyhouse · 2021-11-29T03:59:26Z

paddle/fluid/imperative/gloo_context.cc

@@ -176,6 +176,11 @@ void GLOOParallelContext::AllReduce(const framework::SelectedRows &src,
  }
 }

+void GLOOParallelContext::BroadCast(framework::Variable *src, int ring_id) {


Broadcast？Broadcast是一个单词。另外，这个接口没有实现，为什么还要添加这个接口呢？

sandyhouse · 2021-11-29T04:00:52Z

paddle/fluid/imperative/gloo_context.h

@@ -47,6 +47,8 @@ class GLOOParallelContext : public ParallelContext {
                         framework::Variable* dst, int ring_id,
                         bool use_calc_stream) override;

+  void BroadCast(framework::Variable* src, int ring_id) override;


同上。

gloo接口为什么需要传入ring_id?

sandyhouse · 2021-11-29T04:01:22Z

paddle/fluid/imperative/hccl_context.cc

@@ -158,6 +158,29 @@ void HCCLParallelContext::AllReduceByStream(const framework::Variable &src,
  }
 }

+void HCCLParallelContext::BroadCast(framework::Variable *src, int ring_id) {


BroadCast -> Broadcast？

sandyhouse · 2021-11-29T05:43:40Z

paddle/fluid/imperative/nccl_context.cc

@@ -127,6 +135,20 @@ void NCCLParallelContext::AllReduceByStream(const framework::Variable &src,
  AllReduce(src, dst, strategy_, ring_id, use_calc_stream);
 }

+void NCCLParallelContext::BroadCast(framework::Variable *src, int ring_id) {


BroadCast -> Broadcast?

sandyhouse · 2021-11-29T05:44:21Z

paddle/fluid/imperative/nccl_context.h

@@ -60,6 +60,8 @@ class NCCLParallelContext : public ParallelContext {
                         framework::Variable* dst, int ring_id,
                         bool use_calc_stream) override;

+  void BroadCast(framework::Variable* src, int ring_id) override;


sandyhouse · 2021-11-29T05:44:40Z

paddle/fluid/imperative/parallel_context.h

@@ -56,6 +56,8 @@ class ParallelContext {
                                 framework::Variable* dst, int ring_id,
                                 bool use_calc_stream) = 0;

+  virtual void BroadCast(framework::Variable* src, int ring_id) = 0;


sandyhouse · 2021-11-29T05:46:28Z

paddle/fluid/imperative/reducer.cc

@@ -41,6 +42,9 @@ void Group::DivNRanks(const platform::DeviceContext &context, int64_t nranks) {
 #if defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_RCCL)
    DivNRanks(tensor, nranks, context);
 #endif
+  } else if (platform::is_npu_place(tensor->place())) {
+    // TODO(kuizhiqing)
+    VLOG(4) << "divnrank for npu not support yet";


sandyhouse

LGTM

zhiqiu

LGTM for const_cast

kuizhiqing force-pushed the heter-cl branch from 907914c to 11fc66b Compare November 29, 2021 06:23

sandyhouse reviewed Nov 29, 2021

View reviewed changes

kuizhiqing force-pushed the heter-cl branch from c0bce8f to 5adcc78 Compare November 29, 2021 07:22

kuizhiqing force-pushed the heter-cl branch from 7ebb0da to f9c4073 Compare November 29, 2021 11:40

kuizhiqing force-pushed the heter-cl branch from e717528 to cde612b Compare November 29, 2021 12:31

kuizhiqing force-pushed the heter-cl branch from 1988584 to a3ece7b Compare November 30, 2021 10:21

kuizhiqing added 6 commits November 30, 2021 13:20

heter for collective

a561b87

fix ci

fae2d33

fix cmake

3853463

fix win ci

4c12e82

rm dynamic optmizer

0e2b4e9

fix gloo

232f444

kuizhiqing force-pushed the heter-cl branch from a3ece7b to 232f444 Compare November 30, 2021 13:23

sandyhouse approved these changes Dec 1, 2021

View reviewed changes

XieYunshen approved these changes Dec 1, 2021

View reviewed changes

sandyhouse requested a review from fuyinno4 December 1, 2021 03:30

chenwhql approved these changes Dec 1, 2021

View reviewed changes

zhiqiu approved these changes Dec 1, 2021

View reviewed changes

fuyinno4 approved these changes Dec 1, 2021

View reviewed changes

sandyhouse requested a review from raindrops2sea December 3, 2021 13:34

raindrops2sea approved these changes Dec 5, 2021

View reviewed changes

sandyhouse merged commit 1bdb857 into PaddlePaddle:develop Dec 6, 2021

tianshuo78520a mentioned this pull request Dec 6, 2021

Revert "heter for collective" #37865

Closed

ronny1996 mentioned this pull request Dec 6, 2021

Fix header and api error for "heter for collective" #37868

Merged

Aganlengzi mentioned this pull request Dec 9, 2021

fleet backend supports npu hccl #37549

Closed

Zjq9409 pushed a commit to Zjq9409/Paddle that referenced this pull request Dec 10, 2021

heter for collective (PaddlePaddle#37613)

de6a4f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heter for collective #37613

heter for collective #37613

kuizhiqing commented Nov 26, 2021

paddle-bot-old bot commented Nov 26, 2021

sandyhouse left a comment

sandyhouse Nov 29, 2021

sandyhouse Nov 29, 2021

sandyhouse Nov 29, 2021

sandyhouse Nov 29, 2021

sandyhouse Nov 29, 2021

sandyhouse Nov 29, 2021

sandyhouse Nov 29, 2021

sandyhouse left a comment

zhiqiu left a comment

heter for collective #37613

heter for collective #37613

Conversation

kuizhiqing commented Nov 26, 2021

PR types

PR changes

Describe

paddle-bot-old bot commented Nov 26, 2021

sandyhouse left a comment

Choose a reason for hiding this comment

sandyhouse Nov 29, 2021

Choose a reason for hiding this comment

sandyhouse Nov 29, 2021

Choose a reason for hiding this comment

sandyhouse Nov 29, 2021

Choose a reason for hiding this comment

sandyhouse Nov 29, 2021

Choose a reason for hiding this comment

sandyhouse Nov 29, 2021

Choose a reason for hiding this comment

sandyhouse Nov 29, 2021

Choose a reason for hiding this comment

sandyhouse Nov 29, 2021

Choose a reason for hiding this comment

sandyhouse left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment