New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[3D-parallel] Reformat pipeline parallel #31786

Merged

sandyhouse merged 9 commits into PaddlePaddle:develop from sandyhouse:reformat-pp

Mar 26, 2021

sandyhouse commented Mar 22, 2021 •

edited

Loading

PR types

Others

PR changes

Others

Describe

Reformat the implementation of pipeline parallelism
Unify the implementation of group initialization for creating multiple communication groups.

paddle-bot-old bot commented Mar 22, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

CLAassistant commented Mar 22, 2021 •

edited

Loading

All committers have signed the CLA.


          update, test=develop

383dd3a

sandyhouse force-pushed the reformat-pp branch from 85af993 to 383dd3a Compare

March 22, 2021 12:02

sandyhouse added 8 commits

March 23, 2021 15:59


          fix ut, test=develop

f8071f4


          reformat, test=develop

242d3b7


          reformat, test=develop

cd655b5


          remove data sending/recv, test=develop

eef9380


          fix ut, test=develop

9a288c9


          fix ut, test=develop


          fix ut, test=develop

0350ab0


          fix ut, test=develop

7d5d371

wangxicoding approved these changes

View reviewed changes

Contributor

wangxicoding left a comment

LGTM

python/paddle/distributed/fleet/meta_optimizers/common.py

+                              outputs={'Out': [sync_var]},
+                              attrs={
+                                  'ring_id': global_ring_id,
+                                  'use_calc_stream': True,

Contributor

wangxicoding Mar 26, 2021

need sync calc stream

Author

sandyhouse Mar 26, 2021

add it in next pr.

python/paddle/distributed/fleet/meta_optimizers/pipeline_optimizer.py

                                   origin_param = origin_block.vars[op_role_var[i]]
                                   if origin_param.is_distributed:
                                       continue
-                                  if offset == idx:
-                                      offset += 1
+                                  if not add_sync_calc_stream:

Contributor

wangxicoding Mar 26, 2021

if c_allreduce_sum use calc_stream, this sync_op is unnecessary

Author

sandyhouse Mar 26, 2021

Yes, i'll remove it in next pr.

python/paddle/fluid/contrib/mixed_precision/fp16_utils.py

@@ @@ -123,7 +123,8 @@ def _insert_cast_op(block, op, idx, src_dtype, dest_dtype): @@
                                       outputs={"Out": out_var},
                                       attrs={
                                           "in_dtype": in_var.dtype,
-                                          "out_dtype": out_var.dtype
+                                          "out_dtype": out_var.dtype,
+                                          "op_device": op.attr("op_device")

Contributor

wangxicoding Mar 26, 2021

若cast为fp32->fp16，则可以考虑设置cast的"op_device"为prev_op的"op_device"属性。这样如果添加(send, recv) op，则会cast -- (send, recv) --> op这样插入，(send, recv)传输的则是fp16的输出。
当然若cast为fp16->fp32，则按照当前的设置就好。

Author

sandyhouse Mar 26, 2021

按照评论设计可以降低通信开销。这个可以作为一个优化点后续考虑。

python/paddle/fluid/contrib/mixed_precision/fp16_utils.py

+                          attrs={
+                              "in_dtype": target_var.dtype,
+                              "out_dtype": cast_var.dtype,
+                              "op_device": op.attr("op_device")

Contributor

wangxicoding Mar 26, 2021

same as above

python/paddle/fluid/optimizer.py

@@ @@ -3937,6 +4030,11 @@ def _find_post_op(self, ops, cur_op, var_name): @@
                                              var_name as output.
                           var_name (string): Variable name.
                       """
+                      # To skip the cast op added by amp which has no op_device set

Contributor

wangxicoding Mar 26, 2021

op_device is already added in amp cast, is this also needed?

Author

sandyhouse Mar 26, 2021

在大模型上验证如果没有用到这段代码的话，在下一个pr删除。

python/paddle/fluid/optimizer.py

+                                          attrs={
+                                              self._op_device_key: prev_device,
+                                              self._op_role_key: op_role,
+                                              'use_calc_stream': True,

Contributor

wangxicoding Mar 26, 2021

可优化点：改成comm_stream，如果是forward，可以在反向某个点加一个sync。

Author

sandyhouse Mar 26, 2021

fix it in next pr。

python/paddle/fluid/optimizer.py

+                                              'dtype': var.dtype,
+                                              self._op_device_key: cur_device,
+                                              self._op_role_key: op_role,
+                                              'use_calc_stream': True,

Contributor

wangxicoding Mar 26, 2021

后续可优化点：移到前面，overlap。

python/paddle/fluid/optimizer.py

+                                          'shape': merged_param_grad_var.shape,
+                                          'dtype': merged_param_grad_var.dtype,
+                                          'value': float(0),
+                                          # a trick to run this op once per mini-batch

Contributor

wangxicoding Mar 26, 2021

A more detailed comment is required

Author

sandyhouse Mar 26, 2021

fix it in next pr.

python/paddle/fluid/optimizer.py

                                       'out_shape': read_block.var(var_name).shape,
                                       'dtype': read_block.var(var_name).dtype,
                                       self._op_device_key: read_device,
-                                      'use_calc_stream': True,
+                                      'use_calc_stream': False,

Contributor

wangxicoding Mar 26, 2021

后一个op是sync_comm，可以直接改成use_calc_stream=True

Author

sandyhouse Mar 26, 2021

fix it in next pr.

python/paddle/fluid/optimizer.py

    
                          place_list.append(core.CUDAPlace(local_rank))

                      for dev in device_list:

                          dev_index = int(dev.split(":")[1])

                          place_list.append(core.CUDAPlace(dev_index % 8))

Contributor

wangxicoding Mar 26, 2021

% 8 ?

Author

sandyhouse Mar 26, 2021

fix it in next pr by using fixed value 0.

sandyhouse merged commit c3974d0 into PaddlePaddle:develop

sandyhouse deleted the reformat-pp branch

March 8, 2022 09:58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet