-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi_tensor for momentum optimizer and clear_grads #37564
Add multi_tensor for momentum optimizer and clear_grads #37564
Conversation
Thanks for your contribution! |
python/paddle/optimizer/optimizer.py
Outdated
None | ||
|
||
Examples: | ||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code-block
行与代码正文之间要间隔个空行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
… dev/approve_momentum_py
… dev/approve_momentum_py
… dev/approve_momentum_py
python/paddle/optimizer/momentum.py
Outdated
@@ -129,7 +131,8 @@ def __init__(self, | |||
grad_clip=None, | |||
multi_precision=False, | |||
rescale_grad=1.0, | |||
name=None): | |||
name=None, | |||
use_multi_tensor=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to put use_multi_tensor
before name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
python/paddle/optimizer/momentum.py
Outdated
self.helper = LayerHelper(self.__class__.__name__) | ||
|
||
self._create_global_learning_rate() | ||
if framework.in_dygraph_mode(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about static mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tks, Multi Tensor has been added to the static mode.
python/paddle/optimizer/momentum.py
Outdated
""" | ||
self._create_accumulators(target_block, parameters) | ||
for param in parameters: | ||
if param.stop_gradient is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, done, tks!
python/paddle/optimizer/momentum.py
Outdated
param) | ||
self.velocity_dict['FP16_LODTensor'].append(velocity_acc) | ||
# master weight | ||
# master weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
python/paddle/optimizer/momentum.py
Outdated
# regularization | ||
regularization_method = self._regularization_method | ||
regularization_coeff = self._regularization_coeff | ||
if hasattr(param, 'regularizer'): | ||
# we skip param's l2decay before, so fuse it with momentum here. | ||
if isinstance(param.regularizer, L2DecayRegularizer): | ||
regularization_method = "l2_decay" | ||
regularization_coeff = param.regularizer._regularization_coeff | ||
# the param's regularization has been done before, we avoid do l2decay in momentum. | ||
else: | ||
regularization_method = "" | ||
regularization_coeff = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with fp32, the code can be resued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
python/paddle/optimizer/momentum.py
Outdated
self.grad_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []} | ||
self.lr_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []} | ||
|
||
if framework.in_dygraph_mode(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same above, what about static mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tks, Multi Tensor has been added to the static mode.
python/paddle/optimizer/momentum.py
Outdated
self.grad_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []} | ||
self.lr_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is no need to be attr
of self
, temp var is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
python/paddle/optimizer/optimizer.py
Outdated
# NOTE: Multi Tensor: Pass in all parameters and gradients to the op kernel of the Optimizer at one time for updating for dygraph mode. | ||
# Optimizer support list: [ paddle.optimizer.Momentum ]. | ||
self._use_multi_tensor = None | ||
self.param_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.param_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []} | |
self._param_dict = {'FP32_LODTensor': [], 'FP16_LODTensor': []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks.
python/paddle/optimizer/optimizer.py
Outdated
param_list = [] | ||
if self._parameter_list is None or not isinstance( | ||
self._parameter_list[0], dict): | ||
for p in self._parameter_list: | ||
if not p.stop_gradient: | ||
p.clear_gradient() | ||
if set_to_zero: | ||
p.clear_gradient() | ||
else: | ||
param_list.append(p) | ||
else: | ||
for param_group in self._param_groups: | ||
for p in param_group['params']: | ||
if not p.stop_gradient: | ||
p.clear_gradient() | ||
if set_to_zero: | ||
p.clear_gradient() | ||
else: | ||
param_list.append(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use core.clear_gradients
even if set_to_zero is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks.
python/paddle/optimizer/momentum.py
Outdated
use_multi_tensor=False, | ||
name=None, ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_multi_tensor=False, | |
name=None, ): | |
use_multi_tensor=False, | |
name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
python/paddle/optimizer/momentum.py
Outdated
@@ -72,6 +73,7 @@ class Momentum(Optimizer): | |||
( :ref:`api_fluid_clip_GradientClipByGlobalNorm` , :ref:`api_fluid_clip_GradientClipByNorm` , | |||
:ref:`api_fluid_clip_GradientClipByValue` ). Default None, meaning there is no gradient clipping. | |||
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false. | |||
use_multi_tensor (bool, optional): Whether to use multi-tensor strategy to update all parameters at once . Default is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be listed after rescale_grad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
python/paddle/optimizer/optimizer.py
Outdated
There are two method to clear grad: set_to_zero or delete grad. | ||
|
||
Args: | ||
set_to_zero (bool): If set grads to zero or not, default is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool -> bool, optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, tks!
5423606
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
APIs
Describe
一、pr主要内容:
momentum
优化器动态图添加multi_tensor_apply的优化策略。(依赖于merged_momentum
op的pr)优化器动态图的
clear_grad
添加multi_tensor_apply的优化策略。(依赖VarBase::ClearGradient
的优化pr)二、multi_tensor_apply策略:
2.1、原始优化器执行逻辑:
循环所有的参数,逐个调用优化器kernel进行参数更新。以resnet50模型为例,对优化器执行逻辑及耗时进行统计分析,结果如下:
优化器总耗时9ms,其中6ms(66.7%)的耗时集中在逐个遍历每个网络参数,调用momentum op对参数进行优化。
动态图分支存在一些对参数更新无用的步骤,例如
update_param_device_map(params_grads)
。2.2、使用multi_tensor_apply策略的优化器执行逻辑:
multi_tensor_apply的优化器执行逻辑如下图所示,主要分为两个部分:
黄色部分为数据初始化部分,在第一轮训练会对网络参数进行遍历,对
global_lr
、parameter
、velocity
、regularization
等参数组为list,用于后续优化器op的调用。该流程比较耗时,但在训练一轮后,后续无需再次调用。绿色部分为每轮训练都需要执行的内容:包括将网络所有的
grad
、lr
组成list,调用一次[merged_momentum](https://github.com/PaddlePaddle/Paddle/pull/37527)
op对网络所有的参数进行更新。2.3、clear_grad的multi_tensor_apply逻辑:
与优化器优化逻辑一致,原本clear_grad循环遍历所有的grad,调用
VarBase::ClearGradient(set_to_zero=True)
。主要耗时集中在:反复的python-c++交互;
set_to_zero
模式性能较差、耗时较长。加入multi_tensor_apply策略后,一次传入所有的grad,在C++端训练调用
VarBase::ClearGradient(set_to_zero=False)
。减少了python-c++交互时间、set_to_zero
模式的耗时。三、优化性能测试:
以resnet50为例,bath_size=256,优化器及clear_grad优化前后耗时对比如下:
优化前耗时11ms左右:
![图片](https://user-images.githubusercontent.com/82555433/144167475-8816fcc4-698d-4532-a816-dc0a8eef2c48.png)
优化后耗时6ms左右:
![图片](https://user-images.githubusercontent.com/82555433/144167638-6433ef4f-6453-4387-8796-7ff6206527a8.png)