-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PaddlePaddle Hackathon 3 No.47】为 Paddle 新增 logsumexp support fp16 #45817
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
你好,这个pr只增加了fp32单测。 那么如果我要优化该logsumexp算子数据类型支持,可能需要同时优化支持fp32和fp16的计算精度。 我认为上述fp32单测没问题的话。那fp32和fp16在精度上都需要修改。 |
看上去fp64的精度验证是通过的,但fp32却无法通过。你可否先试试看用numpy实现一个自定义梯度,对比numpy算出的期望梯度值和算子的计算结果有多大差异吗? |
为什么close了呀? |
不好意思,上周我去度假了,赶在19号前没时间继续看了。 |
PR提交的截止时间是19号,合入的截止时间是29号,还有机会的~ |
在inputs取值shape = [2, 3, 4, 5], x = np.random.uniform(-1, 1, shape).astype(dtype)时 但是默认的check_grad仍然会报错。 |
还有user_defined_grads和user_defined_grad_outputs分别对应的numeric_grads和analytic_grads在op_test.py中代表什么。 |
PR中只看到增加了一个FP32的单测,本地是否有代码修改,能否先更新一下? |
PD_REGISTER_KERNEL( | ||
logsumexp, GPU, ALL_LAYOUT, phi::LogsumexpKernel, float, double) {} | ||
logsumexp, GPU, ALL_LAYOUT, phi::LogsumexpKernel, float, double, float16) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogsumexpFunctor实现中存在exp、log,需要使用float作为计算类型。修改方式可以参考下#45952 。
PD_REGISTER_KERNEL(logsumexp_grad, | ||
GPU, | ||
ALL_LAYOUT, | ||
phi::LogsumexpGradKernel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogsumexpGradFunctor实现中存在exp,也需要使用float作为计算类型。
self.dtype = 'float32' | ||
|
||
|
||
class TestLogsumexp_FP16(TestLogsumexp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp16单测可添加如下装饰器,跳过CPU上的执行:
@unittest.skipIf(not core.is_compiled_with_cuda(),
"core is not compiled with CUDA")
另外,若单测精度无法通过,可尝试修改单测中的atol、rtol、max_relative_error。对于float16来说,有效位数只有3位,设置成1e-3也是合理的。
42f6904
to
c0ac7db
Compare
@Xreki 你好,代码conflict已经解了。目前看Functor cast 成double在单测fp32和fp16时也有精度问题。是不是单测还要单独处理才行。 |
dev_ctx, in, out, out_grad, in_grad, functor, axis32); | ||
break; | ||
case 4: | ||
phi::funcs::ReduceGradFunctor<Context, T, 4, LogsumexpGradFunctor>( | ||
phi::funcs::ReduceGradFunctor<Context, T, 4, LogsumexpGradFunctor<T>>( | ||
dev_ctx, in, out, out_grad, in_grad, functor, axis32); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议在这里加一下default的处理,对于大于4维的输入使用PADDLE_THROW
报错。
@@ -74,7 +79,7 @@ void LogsumexpKernel(const Context& dev_ctx, | |||
auto output = phi::EigenScalar<T>::From(*out); | |||
auto& place = *dev_ctx.eigen_device(); | |||
auto reduce_dim = Eigen::array<int, 1>({{0}}); | |||
LogsumexpFunctor()(place, &input, &output, reduce_dim); | |||
LogsumexpFunctor<T>()(place, &input, &output, reduce_dim); | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于不支持的维度,麻烦帮忙加一个报错吧。
|
||
def set_attrs(self): | ||
self.dtype = 'float16' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前看Functor cast 成double在单测fp32和fp16时也有精度问题。是不是单测还要单独处理才行。
该单测中,可以重写下test_check_output和test_check_grad函数,并且指定大一些的atol、max_relative_error精度阈值。
@Xreki 单测已经加好了,请有时间再review一下代码,性能数据今明天补上。 |
@Xreki 你好,性能数据已经更新。fp16性能会随着数据规模增大逐渐变好。因为logsumexp中exp和log运算需要cast成float,总体fp16和fp32性能在一个数量级,小规模fp32没有cast开销性能稍好,随着数据规模增大fp16性能会更好。我本地环境最大只能测到[100, 1000, 250, 50]数据规模,fp32:fp16是1.284。尝试其他几种步骤计算logsumexp op,结果性能和当前几乎无变化。 CI已经通过,请有时间review。 |
x_grad = tensor_x.gradient() | ||
fluid.set_flags({"FLAGS_retain_grad_for_all_tensor": False}) | ||
paddle.enable_static() | ||
return x_grad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前已经不推荐使用fluid的接口,这里建议参考如下单测:
- to_variable -> paddle.to_tensor
- 计算梯度backward() -> padle.grad
Paddle/python/paddle/fluid/tests/unittests/test_layer_norm_op.py
Lines 346 to 388 in 97ec57f
class TestFP16ScaleBiasLayerNorm(unittest.TestCase): | |
def check_main(self, x_np, weight_np, bias_np, dtype): | |
paddle.disable_static() | |
weight_np = weight_np.astype(dtype) | |
bias_np = bias_np.astype(dtype) | |
x = paddle.to_tensor(x_np) | |
weight = paddle.to_tensor(weight_np) | |
bias = paddle.to_tensor(bias_np) | |
x.stop_gradient = False | |
weight.stop_gradient = False | |
bias.stop_gradient = False | |
y = F.layer_norm(x, x.shape[1:], weight, bias) | |
x_g, w_g, b_g = paddle.grad(y, [x, weight, bias]) | |
y_np = y.numpy().astype('float32') | |
x_g_np = x_g.numpy().astype('float32') | |
w_g_np = w_g.numpy().astype('float16') | |
b_g_np = b_g.numpy().astype('float32') | |
paddle.enable_static() | |
return y_np, x_g_np, w_g_np, b_g_np | |
def test_main(self): | |
if not paddle.is_compiled_with_cuda(): | |
return | |
x_np = np.random.random([10, 20]).astype('float16') | |
weight_np = np.random.random([20]).astype('float16') | |
bias_np = np.random.random([20]).astype('float16') | |
y_np_1, x_g_np_1, w_g_np_1, b_g_np_1 = self.check_main( | |
x_np, weight_np, bias_np, 'float16') | |
y_np_2, x_g_np_2, w_g_np_2, b_g_np_2 = self.check_main( | |
x_np, weight_np, bias_np, 'float32') | |
def assert_equal(x, y): | |
np.testing.assert_array_equal(x, y) | |
assert_equal(y_np_1, y_np_2) | |
assert_equal(x_g_np_1, x_g_np_2) | |
assert_equal(w_g_np_1, w_g_np_2) | |
assert_equal(b_g_np_1, b_g_np_2) |
|
||
def logsumexp_ref_grad(x): | ||
sum = np.exp(x).sum() | ||
return np.exp(x) / sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果输入是fp16,这里计算过程都是fp16也会损失精度,这个参考值应该不够精确。建议这里计算过程也用fp32
self.__class__.dtype = self.dtype | ||
x_grad = logsumexp_op_grad(self.inputs['X']) | ||
ref_x_grad = logsumexp_ref_grad(self.inputs['X']) | ||
np.testing.assert_allclose(x_grad, ref_x_grad, rtol=1e-05, atol=1e-04) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里atol设置的是不是有一些问题?allclose的规则:absolute(a - b) <= (atol + rtol * absolute(b)),fp16的相对误差是1e-3,即标准是rotl=1e-3,atol则尽可能是0
@zhangting2020 谢谢,以上review提示已修改,请再检查一遍。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
OPs
Describe
logsumexp support fp16
performance