Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PaddlePaddle Hackathon 3 No.47】为 Paddle 新增 logsumexp support fp16 #45817

Merged
merged 1 commit into from
Oct 13, 2022

Conversation

xiaohemaikoo
Copy link
Contributor

@xiaohemaikoo xiaohemaikoo commented Sep 7, 2022

PR types

New features

PR changes

OPs

Describe

logsumexp support fp16

performance

Case No. input_shape FP32 Perf(us) FP16 Perf(us) diff
0 [1000, 130, 17] 173.735 206.377 0.842
1 [1000, 100, 10, 10] 576.93 651.485 0.886
2 [1000, 100, 200] 1089.02 1219.52 0.893
3 [100, 1000, 25, 40] 5195 5766.75 0.901
4 [100, 1000, 250, 40] 40763 39548.1 1.031
5 [100, 1000, 250, 50] 64426 50173.5 1.284

@paddle-bot
Copy link

paddle-bot bot commented Sep 7, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels Sep 7, 2022
@xiaohemaikoo
Copy link
Contributor Author

你好,这个pr只增加了fp32单测。
我看到 test_logsumexp.py 中并没有添加针对fp32计算精度的单测。
按照PR中提交的单测测试fp32后,发现fp32精度也没达到 1e-3内的要求。

那么如果我要优化该logsumexp算子数据类型支持,可能需要同时优化支持fp32和fp16的计算精度。
同时保持性能不差于原始代码使用 float32 类型计算可能无法达到要求,因为float32原本的精度就不达标。
优化float32精度的时候,fp32的性能可能会相对之前同步变差。

我认为上述fp32单测没问题的话。那fp32和fp16在精度上都需要修改。

@zhangting2020
Copy link
Contributor

看上去fp64的精度验证是通过的,但fp32却无法通过。你可否先试试看用numpy实现一个自定义梯度,对比numpy算出的期望梯度值和算子的计算结果有多大差异吗?

@Ligoml
Copy link
Contributor

Ligoml commented Sep 14, 2022

为什么close了呀?

@xiaohemaikoo
Copy link
Contributor Author

为什么close了呀?

不好意思,上周我去度假了,赶在19号前没时间继续看了。

@Ligoml
Copy link
Contributor

Ligoml commented Sep 20, 2022

PR提交的截止时间是19号,合入的截止时间是29号,还有机会的~

@Ligoml Ligoml reopened this Sep 20, 2022
@xiaohemaikoo
Copy link
Contributor Author

在inputs取值shape = [2, 3, 4, 5], x = np.random.uniform(-1, 1, shape).astype(dtype)时
本地numpy自定义梯度对比算子的计算梯度结果在fp32和fp16时并无太大差异。各项结果都是小于1e-3的。

但是默认的check_grad仍然会报错。
请问你们定义的user_defined_grads,和 user_defined_grad_outputs具体指的是什么,
fp32时默认计算的user_defined_grads和user_defined_grad_outputs会出现大于5e-3的误差。

@xiaohemaikoo
Copy link
Contributor Author

还有user_defined_grads和user_defined_grad_outputs分别对应的numeric_grads和analytic_grads在op_test.py中代表什么。
因为是默认计算出的numeric_grads和analytic_grads误差比较大。

@Xreki
Copy link
Contributor

Xreki commented Oct 8, 2022

在inputs取值shape = [2, 3, 4, 5], x = np.random.uniform(-1, 1, shape).astype(dtype)时
本地numpy自定义梯度对比算子的计算梯度结果在fp32和fp16时并无太大差异。各项结果都是小于1e-3的。

但是默认的check_grad仍然会报错。
请问你们定义的user_defined_grads,和 user_defined_grad_outputs具体指的是什么,
fp32时默认计算的user_defined_grads和user_defined_grad_outputs会出现大于5e-3的误差。

还有user_defined_grads和user_defined_grad_outputs分别对应的numeric_grads和analytic_grads在op_test.py中代表什么。
因为是默认计算出的numeric_grads和analytic_grads误差比较大。

PR中只看到增加了一个FP32的单测,本地是否有代码修改,能否先更新一下?

PD_REGISTER_KERNEL(
logsumexp, GPU, ALL_LAYOUT, phi::LogsumexpKernel, float, double) {}
logsumexp, GPU, ALL_LAYOUT, phi::LogsumexpKernel, float, double, float16) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogsumexpFunctor实现中存在exp、log,需要使用float作为计算类型。修改方式可以参考下#45952

image

PD_REGISTER_KERNEL(logsumexp_grad,
GPU,
ALL_LAYOUT,
phi::LogsumexpGradKernel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogsumexpGradFunctor实现中存在exp,也需要使用float作为计算类型。

self.dtype = 'float32'


class TestLogsumexp_FP16(TestLogsumexp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp16单测可添加如下装饰器,跳过CPU上的执行:

@unittest.skipIf(not core.is_compiled_with_cuda(),
                       "core is not compiled with CUDA")

另外,若单测精度无法通过,可尝试修改单测中的atol、rtol、max_relative_error。对于float16来说,有效位数只有3位,设置成1e-3也是合理的。

@xiaohemaikoo
Copy link
Contributor Author

@Xreki 你好,代码conflict已经解了。目前看Functor cast 成double在单测fp32和fp16时也有精度问题。是不是单测还要单独处理才行。

dev_ctx, in, out, out_grad, in_grad, functor, axis32);
break;
case 4:
phi::funcs::ReduceGradFunctor<Context, T, 4, LogsumexpGradFunctor>(
phi::funcs::ReduceGradFunctor<Context, T, 4, LogsumexpGradFunctor<T>>(
dev_ctx, in, out, out_grad, in_grad, functor, axis32);
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议在这里加一下default的处理,对于大于4维的输入使用PADDLE_THROW报错。

@@ -74,7 +79,7 @@ void LogsumexpKernel(const Context& dev_ctx,
auto output = phi::EigenScalar<T>::From(*out);
auto& place = *dev_ctx.eigen_device();
auto reduce_dim = Eigen::array<int, 1>({{0}});
LogsumexpFunctor()(place, &input, &output, reduce_dim);
LogsumexpFunctor<T>()(place, &input, &output, reduce_dim);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于不支持的维度,麻烦帮忙加一个报错吧。


def set_attrs(self):
self.dtype = 'float16'

Copy link
Contributor

@Xreki Xreki Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前看Functor cast 成double在单测fp32和fp16时也有精度问题。是不是单测还要单独处理才行。

该单测中,可以重写下test_check_output和test_check_grad函数,并且指定大一些的atol、max_relative_error精度阈值。

@xiaohemaikoo
Copy link
Contributor Author

@Xreki 单测已经加好了,请有时间再review一下代码,性能数据今明天补上。

@xiaohemaikoo
Copy link
Contributor Author

@Xreki 你好,性能数据已经更新。fp16性能会随着数据规模增大逐渐变好。因为logsumexp中exp和log运算需要cast成float,总体fp16和fp32性能在一个数量级,小规模fp32没有cast开销性能稍好,随着数据规模增大fp16性能会更好。我本地环境最大只能测到[100, 1000, 250, 50]数据规模,fp32:fp16是1.284。尝试其他几种步骤计算logsumexp op,结果性能和当前几乎无变化。 CI已经通过,请有时间review。

x_grad = tensor_x.gradient()
fluid.set_flags({"FLAGS_retain_grad_for_all_tensor": False})
paddle.enable_static()
return x_grad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前已经不推荐使用fluid的接口,这里建议参考如下单测:

  • to_variable -> paddle.to_tensor
  • 计算梯度backward() -> padle.grad

class TestFP16ScaleBiasLayerNorm(unittest.TestCase):
def check_main(self, x_np, weight_np, bias_np, dtype):
paddle.disable_static()
weight_np = weight_np.astype(dtype)
bias_np = bias_np.astype(dtype)
x = paddle.to_tensor(x_np)
weight = paddle.to_tensor(weight_np)
bias = paddle.to_tensor(bias_np)
x.stop_gradient = False
weight.stop_gradient = False
bias.stop_gradient = False
y = F.layer_norm(x, x.shape[1:], weight, bias)
x_g, w_g, b_g = paddle.grad(y, [x, weight, bias])
y_np = y.numpy().astype('float32')
x_g_np = x_g.numpy().astype('float32')
w_g_np = w_g.numpy().astype('float16')
b_g_np = b_g.numpy().astype('float32')
paddle.enable_static()
return y_np, x_g_np, w_g_np, b_g_np
def test_main(self):
if not paddle.is_compiled_with_cuda():
return
x_np = np.random.random([10, 20]).astype('float16')
weight_np = np.random.random([20]).astype('float16')
bias_np = np.random.random([20]).astype('float16')
y_np_1, x_g_np_1, w_g_np_1, b_g_np_1 = self.check_main(
x_np, weight_np, bias_np, 'float16')
y_np_2, x_g_np_2, w_g_np_2, b_g_np_2 = self.check_main(
x_np, weight_np, bias_np, 'float32')
def assert_equal(x, y):
np.testing.assert_array_equal(x, y)
assert_equal(y_np_1, y_np_2)
assert_equal(x_g_np_1, x_g_np_2)
assert_equal(w_g_np_1, w_g_np_2)
assert_equal(b_g_np_1, b_g_np_2)


def logsumexp_ref_grad(x):
sum = np.exp(x).sum()
return np.exp(x) / sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果输入是fp16,这里计算过程都是fp16也会损失精度,这个参考值应该不够精确。建议这里计算过程也用fp32

self.__class__.dtype = self.dtype
x_grad = logsumexp_op_grad(self.inputs['X'])
ref_x_grad = logsumexp_ref_grad(self.inputs['X'])
np.testing.assert_allclose(x_grad, ref_x_grad, rtol=1e-05, atol=1e-04)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里atol设置的是不是有一些问题?allclose的规则:absolute(a - b) <= (atol + rtol * absolute(b)),fp16的相对误差是1e-3,即标准是rotl=1e-3,atol则尽可能是0

@xiaohemaikoo
Copy link
Contributor Author

@zhangting2020 谢谢,以上review提示已修改,请再检查一遍。

Copy link
Contributor

@zhangting2020 zhangting2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangting2020 zhangting2020 merged commit 910e1b6 into PaddlePaddle:develop Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants