Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi_tensor for momentum optimizer and clear_grads #37564

Merged
merged 30 commits into from
Dec 20, 2021

Conversation

zhangbo9674
Copy link
Contributor

@zhangbo9674 zhangbo9674 commented Nov 25, 2021

PR types

New features

PR changes

APIs

Describe

一、pr主要内容:

  • momentum优化器动态图添加multi_tensor_apply的优化策略。(依赖于merged_momentumop的pr

  • 优化器动态图的clear_grad添加multi_tensor_apply的优化策略。(依赖VarBase::ClearGradient的优化pr

二、multi_tensor_apply策略:

2.1、原始优化器执行逻辑:

循环所有的参数,逐个调用优化器kernel进行参数更新。以resnet50模型为例,对优化器执行逻辑及耗时进行统计分析,结果如下:

  • 优化器总耗时9ms,其中6ms(66.7%)的耗时集中在逐个遍历每个网络参数,调用momentum op对参数进行优化。

  • 动态图分支存在一些对参数更新无用的步骤,例如update_param_device_map(params_grads)

图片

2.2、使用multi_tensor_apply策略的优化器执行逻辑:

multi_tensor_apply的优化器执行逻辑如下图所示,主要分为两个部分:

  • 黄色部分为数据初始化部分,在第一轮训练会对网络参数进行遍历,对global_lrparametervelocityregularization等参数组为list,用于后续优化器op的调用。该流程比较耗时,但在训练一轮后,后续无需再次调用。

  • 绿色部分为每轮训练都需要执行的内容:包括将网络所有的gradlr组成list,调用一次[merged_momentum](https://github.com/PaddlePaddle/Paddle/pull/37527)op对网络所有的参数进行更新。

图片

2.3、clear_grad的multi_tensor_apply逻辑:

与优化器优化逻辑一致,原本clear_grad循环遍历所有的grad,调用VarBase::ClearGradient(set_to_zero=True)。主要耗时集中在:

  • 反复的python-c++交互;

  • set_to_zero模式性能较差、耗时较长。

加入multi_tensor_apply策略后,一次传入所有的grad,在C++端训练调用VarBase::ClearGradient(set_to_zero=False)。减少了python-c++交互时间、set_to_zero模式的耗时。

三、优化性能测试:

以resnet50为例,bath_size=256,优化器及clear_grad优化前后耗时对比如下:

  • 优化前耗时11ms左右:
    图片

  • 优化后耗时6ms左右:
    图片

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@zhangbo9674 zhangbo9674 changed the title [Opt]add multi_tensor for momentum and clear_grads for optimizer Add multi_tensor for momentum and clear_grads for optimizer Dec 1, 2021
@zhangbo9674 zhangbo9674 changed the title Add multi_tensor for momentum and clear_grads for optimizer Add multi_tensor for momentum optimizer and clear_grads Dec 1, 2021
None

Examples:
.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code-block行与代码正文之间要间隔个空行

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

@@ -129,7 +131,8 @@ def __init__(self,
grad_clip=None,
multi_precision=False,
rescale_grad=1.0,
name=None):
name=None,
use_multi_tensor=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to put use_multi_tensor before name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

self.helper = LayerHelper(self.__class__.__name__)

self._create_global_learning_rate()
if framework.in_dygraph_mode():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about static mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tks, Multi Tensor has been added to the static mode.

"""
self._create_accumulators(target_block, parameters)
for param in parameters:
if param.stop_gradient is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this if needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed, done, tks!

param)
self.velocity_dict['FP16_LODTensor'].append(velocity_acc)
# master weight
# master weight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

Comment on lines 333 to 344
# regularization
regularization_method = self._regularization_method
regularization_coeff = self._regularization_coeff
if hasattr(param, 'regularizer'):
# we skip param's l2decay before, so fuse it with momentum here.
if isinstance(param.regularizer, L2DecayRegularizer):
regularization_method = "l2_decay"
regularization_coeff = param.regularizer._regularization_coeff
# the param's regularization has been done before, we avoid do l2decay in momentum.
else:
regularization_method = ""
regularization_coeff = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with fp32, the code can be resued.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

self.grad_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []}
self.lr_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []}

if framework.in_dygraph_mode():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same above, what about static mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tks, Multi Tensor has been added to the static mode.

Comment on lines 492 to 493
self.grad_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []}
self.lr_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no need to be attr of self, temp var is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

# NOTE: Multi Tensor: Pass in all parameters and gradients to the op kernel of the Optimizer at one time for updating for dygraph mode.
# Optimizer support list: [ paddle.optimizer.Momentum ].
self._use_multi_tensor = None
self.param_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.param_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []}
self._param_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks.

Comment on lines 1056 to 1072
param_list = []
if self._parameter_list is None or not isinstance(
self._parameter_list[0], dict):
for p in self._parameter_list:
if not p.stop_gradient:
p.clear_gradient()
if set_to_zero:
p.clear_gradient()
else:
param_list.append(p)
else:
for param_group in self._param_groups:
for p in param_group['params']:
if not p.stop_gradient:
p.clear_gradient()
if set_to_zero:
p.clear_gradient()
else:
param_list.append(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use core.clear_gradients even if set_to_zero is true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks.

Comment on lines 135 to 136
use_multi_tensor=False,
name=None, ):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_multi_tensor=False,
name=None, ):
use_multi_tensor=False,
name=None):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

@@ -72,6 +73,7 @@ class Momentum(Optimizer):
( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` ,
:ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there is no gradient clipping.
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false.
use_multi_tensor (bool, optional): Whether to use multi-tensor strategy to update all parameters at once . Default is false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be listed after rescale_grad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

There are two method to clear grad: set_to_zero or delete grad.

Args:
set_to_zero (bool): If set grads to zero or not, default is True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool -> bool, optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, tks!

lanxianghit
lanxianghit previously approved these changes Dec 17, 2021
TCChenlong
TCChenlong previously approved these changes Dec 17, 2021
@zhangbo9674 zhangbo9674 dismissed stale reviews from TCChenlong and lanxianghit via 5423606 December 17, 2021 10:19
Copy link
Contributor

@TCChenlong TCChenlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhiqiu zhiqiu merged commit 0cc5e22 into PaddlePaddle:develop Dec 20, 2021
@zhangbo9674 zhangbo9674 deleted the dev/approve_momentum_py branch March 2, 2023 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants