You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This problem came into light when I was investigating the method we could use to move regularization to Pserver (#7432). The current distributed transpiler splits the params and the grads and passes different slices to each pserver. Hence, when we create optimize ops, we use sliced parameters and gradients. However, the distribute transpiler currently does this through a hack. The transpiler identifies these ops by checking if the op contains inputs called Param and Grad. This works well because optimize ops have their own dedicated operations called sgd_op, adam_op, etc.
However, in case of regularization and gradient clipping, we rely on generic tensor ops like scale and elementwise_add. These ops take as inputs the parameters and thus on the pserver they should take the sliced parameters as inputs. Thus we need a way to identify these ops in the distribute transpiler, so that we can make sure that we pass the sliced params and grads as inputs to them. The above-mentioned hack will not work for this case because these are generic ops which have input and output names like X, Y, etc.
A hacky solution would be to create dedicated ops for regularization. Currently, regularization layer adds a scale and an elementwise_add op in Python. Instead, we could create a separate op which composes these 2 ops in C++.
A better and a more sustainable solution would be to support adding tags to Python ops. This could allow us to group ops of similar tags. In this way, we can make sure that all the ops that are added for regularization carry a regularization tag. Similarly, gradient clipping ops carry a tag. The distribute transpiler can then process the ops by tag and apply whatever slicing logic it needs to apply to them. These tags are similar to the concept of Collections in Tensorflow.
The text was updated successfully, but these errors were encountered:
Maybe we don't need to implement the "A hacky solution would be to create dedicated ops for regularization", if "put regularization on pserver" is not a hard requirement for the Feburary deadline? Otherwise the code will be removed after the "correct" solution is in place.
您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!
This problem came into light when I was investigating the method we could use to move regularization to Pserver (#7432). The current distributed transpiler splits the
params
and thegrads
and passes different slices to each pserver. Hence, when we create optimize ops, we use sliced parameters and gradients. However, the distribute transpiler currently does this through a hack. The transpiler identifies theseops
by checking if the op contains inputs calledParam
andGrad
. This works well because optimize ops have their own dedicated operations calledsgd_op
,adam_op
, etc.However, in case of regularization and gradient clipping, we rely on generic tensor ops like
scale
andelementwise_add
. These ops take as inputs the parameters and thus on the pserver they should take the sliced parameters as inputs. Thus we need a way to identify these ops in the distribute transpiler, so that we can make sure that we pass the slicedparams
andgrads
as inputs to them. The above-mentioned hack will not work for this case because these are generic ops which have input and output names likeX
,Y
, etc.A hacky solution would be to create dedicated ops for regularization. Currently, regularization layer adds a scale and an elementwise_add op in Python. Instead, we could create a separate op which composes these 2 ops in C++.
A better and a more sustainable solution would be to support adding tags to Python ops. This could allow us to group ops of similar tags. In this way, we can make sure that all the ops that are added for regularization carry a regularization tag. Similarly, gradient clipping ops carry a tag. The distribute transpiler can then process the ops by tag and apply whatever slicing logic it needs to apply to them. These tags are similar to the concept of Collections in Tensorflow.
The text was updated successfully, but these errors were encountered: