Add no_sync in data parallel for dynamic graph #34740

haohongxiang · 2021-08-09T14:35:56Z

PR types

New features

PR changes

APIs

Describe

Add no_sync in data parallel for dynamic graph

1、接口形式：

dp_layer = paddle.DataParallel(layer)
with dp_layer.no_sync():
    dp_layer(input_1).backward() # no synchronization, accumulate grads
dp_layer(input_2).backward() # synchronize grads

2、使用文档：

中文：

English Ver：

3、功能支持：

no_sync支持动态图数据并行中暂停梯度同步，支持accum_gradient；

在梯度累加循环中减少不必要的同步操作，不影响精度且一定程度上提升性能。

4、测试方案：

面对复杂的组网情况，实现no_sync后为每种case均提供单测，进行单卡与多卡、多卡与多卡的精度对比。

组网情况包含unused_params、复杂控制流等，参考自PR：#32826

自测结果：no_sync能够cover上述全部case，精度与单卡运行时无差。

paddle-bot-old · 2021-08-09T14:36:00Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ForFishes · 2021-08-10T06:15:42Z

python/paddle/fluid/dygraph/parallel.py

@@ -576,9 +578,19 @@ def _find_varbase(self, obj):
            return itertools.chain(*map(self._find_varbase, obj.values()))
        return []

+    @contextmanager
+    def no_sync(self):


Document description, api interface description, usage description

ForFishes · 2021-08-10T06:16:31Z

paddle/fluid/imperative/reducer.cc

@@ -527,6 +527,7 @@ void Reducer::TraverseBackwardGraph(
 void Reducer::PrepareForBackward(
    const std::vector<std::shared_ptr<imperative::VarBase>> &outputs) {
  VLOG(3) << "after forward, then reset count for backward.";
+  grad_need_hooks_ = true;


Add a note to explain the role of this parameter

Thanks. Already added notes in Line 212~215 of paddle/fluid/imperative/reducer.h

ForFishes · 2021-08-10T06:17:03Z

paddle/fluid/imperative/reducer.cc

@@ -907,6 +912,7 @@ void Reducer::ProcessUnusedDenseVars() {

      // 3. create grad var base or get grad var base
      auto grad_var_base_tmp = dest_var_base->MutableGradVarBase();
+      grad_var_base_tmp->SharedVar()->SetIsEmpty(false);


Explain the reason for this modification

ForFishes · 2021-08-10T06:18:24Z

python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync.py

@@ -0,0 +1,175 @@
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.


2018？->2021

ForFishes · 2021-08-10T06:19:09Z

python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync_control_flow.py

@@ -0,0 +1,176 @@
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.


ForFishes · 2021-08-10T06:19:40Z

python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync_unused_params.py

@@ -0,0 +1,179 @@
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.


ForFishes · 2021-08-10T06:19:49Z

python/paddle/fluid/tests/unittests/test_parallel_dygraph_no_sync.py

@@ -0,0 +1,100 @@
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.


… dev

XieYunshen

LGTM

qingqing01

LGTM for API

qingqing01 · 2021-08-24T07:55:39Z

python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync.py

+batch_num = 1000
+
+
+class SimpleNet(fluid.Layer):


单测中，也需要使用 paddle.nn.Layer, 非fluid下面的API。

qingqing01 · 2021-08-24T08:01:33Z

python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync_control_flow.py

+        return x
+
+
+class TestNoSyncControlFlow(TestParallelDyGraphRunnerBase):


这里的单测是在 TestParallelDyGraphRunnerBase 里 check 加no_sync之后的梯度正确性嘛？下面代码没有看到怎么check的正确性

no_sync的单测是通过重写TestParallelDyGraphRunnerBase类中get_model、run_one_loop、run_trainer、run_trainer_with_spawn这四个函数来完成的，check加no_sync之后的梯度正确性的部分已由框架实现，是不需要自己实现的。

XiaoguangHu01

LG API

ForFishes

LGTM

Add no_sync in data parallel for dynamic graph

85d88b8

haohongxiang added 6 commits August 9, 2021 23:21

modify UT of no_sync

139bf61

delete test_parallel_dygraph_dataparallel_no_sync.py

07834f1

add test_parallel_dygraph_no_sync.py

e971051

modify run_trainer_with_spawn in UTs

c3ac5ec

Add UT of complex control flow in no_sync

4a5cc9a

modify UT

60c4b42

ForFishes reviewed Aug 10, 2021

View reviewed changes

haohongxiang added 4 commits August 10, 2021 16:52

add specific descriptions and notes for no_sync

10a0840

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

1047383

… dev

check code style

bca069b

modify UT's TIMEOUT in CMakeLists.txt

a700d7d

XieYunshen approved these changes Aug 18, 2021

View reviewed changes

TCChenlong approved these changes Aug 24, 2021

View reviewed changes

qingqing01 reviewed Aug 24, 2021

View reviewed changes

XiaoguangHu01 approved these changes Aug 24, 2021

View reviewed changes

ForFishes approved these changes Aug 24, 2021

View reviewed changes

ForFishes merged commit b09f4d7 into PaddlePaddle:develop Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add no_sync in data parallel for dynamic graph #34740

Add no_sync in data parallel for dynamic graph #34740

haohongxiang commented Aug 9, 2021 •

edited

Loading

paddle-bot-old bot commented Aug 9, 2021

ForFishes Aug 10, 2021

ForFishes Aug 10, 2021

haohongxiang Aug 12, 2021

ForFishes Aug 10, 2021

ForFishes Aug 10, 2021

ForFishes Aug 10, 2021

ForFishes Aug 10, 2021

ForFishes Aug 10, 2021

XieYunshen left a comment

qingqing01 left a comment

qingqing01 Aug 24, 2021

haohongxiang Aug 24, 2021

qingqing01 Aug 24, 2021

haohongxiang Aug 24, 2021 •

edited

Loading

XiaoguangHu01 left a comment

ForFishes left a comment

		@@ -0,0 +1,175 @@
		# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,176 @@
		# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,179 @@
		# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,100 @@
		# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

		return x


		class TestNoSyncControlFlow(TestParallelDyGraphRunnerBase):

Add no_sync in data parallel for dynamic graph #34740

Add no_sync in data parallel for dynamic graph #34740

Conversation

haohongxiang commented Aug 9, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Aug 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

qingqing01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haohongxiang Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

haohongxiang commented Aug 9, 2021 •

edited

Loading

haohongxiang Aug 24, 2021 •

edited

Loading