【Hackathon No.25】为paddle新增 nanquantile #55

Asthestarsfalll · 2022-03-20T10:19:21Z

为paddle新增 nanquantile

paddle-bot-old · 2022-03-20T10:19:29Z

你的PR提交成功，感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备，具体请参考示例和模版。
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

luotao1 · 2022-03-21T03:38:51Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+
+## 1、相关背景
+
+为了提升飞桨API丰富度，支持科学计算领域API，Paddle需要扩充API`paddle.nanquantile`以及`paddle.Tensor.nanquantile`.paddle.nanquantile 是 [paddle.quantile](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/quantile_cn.html) 的变体，即沿给定的轴计算非nan元素的分位数。


以及paddle.Tensor.nanquantile.（英文句号得换成中文的），全文有好多处英文标点，都要换成中文的。

luotao1 · 2022-03-21T03:40:29Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+2. 使用`paddle.sort`得到排序后的tensor.
+3. 将`q`:[0, 1]映射到`indice`:[0, numel_of_dim-1]；并对`indice`分别做`paddle.floor`和`paddle.ceil`求得需要计算的两端元素位置；
+4. `paddle.lerp`计算两端元素的加权插值，作为最终结果；
+5. 根据`keepdim`参数调整至对应的shape。


参考的实现逻辑里缺了对NaN的处理。

对NaN的处理，对原tensor采用paddle.isnan检查Nan值，包含NaN的，在步骤4所对应位置的元素置NaN。

可以引用下paddle.quantile的设计文档

感谢您的建议，稍后将会进行修改更新！
这里是使用了paddle.quantile的设计文档中的实现逻辑，但是在检查代码时并没有发现对NaN的处理。经过测试发现实际上paddle.quantile在输入含有NaN值并不能与Numpy和Pytorch很好的对齐，对于指定维度上含有NaN时，Numpy和Pytorch会返回NaN，而paddle.quantile会将NaN值当做一个占位符，依旧返回数字，并且由于paddle.sort的问题，其结果也是不正确的。
对于目前的情况要想保证结果的正确性，按照我的思路，其计算逻辑将会与nanquantile更加靠近，代码复用性也会更好，如果流程和时间允许的话，我可以尝试一并修复。

如果流程和时间允许的话，我可以尝试一并修复

我们内部讨论下关于paddle.quantile和paddle.sort处理的问题，稍后给您回复

简述一下我的思路吧

对于quantile和nanquantile都会使用isnan来找出NaN，并且替换为Inf。

对于quantile使用sum、any和logical_not将存在NaN的位置置0，这样计算indices的时候该位置就会变为-1，而Inf会被排到最后，后面取出的时候会就会取到Inf，输出时再替换为NaN。

nanquantile也是同理，如设计文档中描述的，会将全是NaN的位置置0，indices置为-1，后面取值同上。
另外一点就是原本的quantile中indices是一个数，后面会用expand_dim广播成对应的矩阵，其默认所有位置的indices都是同一个值。
而在nanquantitle中计算indices前的sum的参数keepdim=True，本身就是形状正确的矩阵，修改后的quantile也同样需要一个indices矩阵，因为不同的位置indices不同。
这样一来二者之间唯一不同的地方就只有计算indices了。

@Asthestarsfalll 我和 @zoooo0820 @wawltor 讨论下：

通过上述思路来修改quantile和nanquantile是可以的，辛苦更新下RFC文档，并同步修改下quantile相关代码

paddle.sort修改起来比较麻烦：1维是调用thrust库实现，因此修改的同时要保证性能。我们也给thrust库提了issue在跟进中：sort_by_key returned bad result when the tensor had NaN value NVIDIA/thrust#1637

luotao1 · 2022-03-21T03:51:56Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+# 四、对比分析
+
+- 使用场景与功能：在维度支持上，Pytorch只支持一维，而Numpy支持多维，这里对齐Numpy的实现逻辑，同时支持一维和多维场景。
+- 实现对比：由于`pytorch.gather`和`paddle.gather`实际在秩大于1时的表现不一致；在出现多个`q`值时，pytorch可直接通过处理后的`indice`进行多维索引，paddle则需要分别索引再组合到一起。因此这里不再使用`paddle.gather`索引，改使用`paddle.take_along_axis`API进行索引。


129行是paddle.quantile分析torch.quantile实现逻辑中下述代码，得到的对比：但本设计文档没有引用这段代码，因此129行可以去掉。

Tensor ranks_below = ranks.toType(kLong); Tensor values_below = sorted.gather(-1, ranks_below); // Actual interpolation is only needed for the liner and midpoint modes if (interpolation == QUANTILE_INTERPOLATION_MODE::LINEAR || interpolation == QUANTILE_INTERPOLATION_MODE::MIDPOINT) { // calculate weights for linear and midpoint Tensor weights = interpolation == QUANTILE_INTERPOLATION_MODE::MIDPOINT ? at::full_like(ranks, 0.5) : ranks - ranks_below; // Interpolate to compute quantiles and store in values_below Tensor ranks_above = ranks.ceil_().toType(kLong); Tensor values_above = sorted.gather(-1, ranks_above); values_below.lerp_(values_above, weights);

对比设计可以更注重在是否要最大化的共用已有的代码逻辑，如torch.quantile和torch.nanquantile共用了代码，torch.nanquantile只是新增了对NaN的处理

luotao1 · 2022-03-21T03:52:19Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+
+API设计为`paddle.unquantile(x, q, axis=None, keepdim=False, name=None)`及`paddle.Tensor.unquantile(q, axis=None, keepdim=False, name=None)`
+命名与参数顺序为：形参名`input`->`x`和`dim`->`axis`,  与paddle其他API保持一致性，不影响实际功能使用。
+参数类型中，`axis`支持`int`与`1-D Tensor`输入,以同时支持一维和多维的场景。


输入,（英文逗号要改成中文），全文还有多处

luotao1 · 2022-03-21T03:55:02Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+5. 根据`keepdim`参数，确定是否需要对应调整结果的shape；
+5. 将结果中的`Inf`再替换回`NaN`，输出即可。
+
+- 如果后续版本`paddle.sort`支持将NaN排序到最后，即可将两次`NaN`和`Inf`的转化取消。


根据上面的对比分析，明确是否要和quantile最大化的共用代码逻辑

如果是的话，需要区分哪些是已有quantile的逻辑，哪些是新增的，这样更加清晰。即写明专门处理NaN的逻辑是哪些

可以对quantile的python端代码进行修改，更加模块化，便于nanquantile的复用。

感谢对paddle.sort不能支持nan的发现和提议！

即可将两次NaN和Inf的转化取消。

即可将第一步和第十步的两次NaN和Inf的转化取消

luotao1 · 2022-03-21T03:55:34Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+
+## API实现方案
+
+主要按下列步骤进行组合实现, 实现位置为`paddle/tensor/stat.py`与`mean`,`median`等方法放在一起：


与quantile放在一起更加合理

luotao1 · 2022-03-21T04:01:34Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+
+# 八、影响面
+
+为独立新增API，对其他模块没有影响。


可以对quantile的python端代码进行修改，更加模块化，便于nanquantile的复用。

这样对quantile可能有影响。

paddle-bot-old · 2022-03-22T02:05:57Z

你的PR有最新反馈，请及时修改。
There’s the latest feedback about your PR. Please check.

Asthestarsfalll · 2022-03-22T13:29:24Z

@luotao1
修改完代码进行验证的时候发现take_along_axis对indices中-1的位置会返回0.0，并不是期望的最后一个元素，对于nanquantile，paddle.lerp只要输入其一为Inf/NaN，其输出就会是Inf/NaN，所以nanquantile的结果是正确的。
因此需要对quantile的计算逻辑进行修改，目前有两种方案：

将NaN替换为-Inf，排序时将会排在最前面，用any和logical_not将其置0，取值时将会取到0处的NaN，从而保证结果的正确，但是如果后续版本paddle.sort支持对NaN的排序，则不方便将NaN和Inf的相互转换去掉；
依旧替换为Inf，计算indices时将对应位置设为指定轴上最后一个元素的索引，过程相对繁琐，若使用paddle.index_fill则更简单，paddle.index_fill则在本次的黑客松任务中，尚未有实现。
第一种做了实现，经过验证各种情况下都能与Numpy的结果对齐。

luotao1 · 2022-03-24T03:48:31Z

可以将各种方案的考虑都写入RFC文档，然后选择第一种方案实现「将NaN替换为-Inf，排序时将会排在最前面」。

…to nanquantile

Asthestarsfalll · 2022-03-25T11:53:57Z

已修改，最终选择了方案二，后续可直接去除NaN与Inf的转换。

luotao1 · 2022-03-28T08:41:57Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+# 四、对比分析
+
+- 使用场景与功能：在维度支持上，Pytorch只支持一维，而Numpy支持多维，这里对齐Numpy的实现逻辑，同时支持一维和多维场景。
+- 代码复用：Pytorch与Numpy都是针对输入


129行是没写完么？

luotao1 · 2022-03-28T08:43:37Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+
+1. 对`mask`使用`paddle.logical_not`取反，在指定维度上求和，得到每个位置上的有效数字的个数，这是一个矩阵；
+
+2. 对替换后的tensor使用`paddle.sort`，因为`sort`不能对含有`NaN`的输入进行正确排序；


因为sort不能对含有NaN的输入进行正确排序；

这句应该放在150行后面，用来解释为什么要替换成Inf。放在154行后面不太通顺

对替换后的tensor-》对第一步替换后的tensor，因为用的不是第二步的输出

luotao1 · 2022-03-28T08:56:24Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+
+2. 对替换后的tensor使用`paddle.sort`，因为`sort`不能对含有`NaN`的输入进行正确排序；
+
+2. 使用上述**有效数字矩阵**-1乘以`q`:[0, 1]得到`indices`:[-1, dim_wo_nan-1]；


有效数字矩阵-》第二步的有效数字矩阵

-1是指什么呢

luotao1 · 2022-03-28T08:57:56Z

rfcs/APIs/20220420_api_design_for_nanquantile.md

+5. 根据`keepdim`参数，确定是否需要对应调整结果的shape；
+5. 将结果中的`Inf`再替换回`NaN`，输出即可。
+
+- 如果后续版本`paddle.sort`支持将NaN排序到最后，即可将两次`NaN`和`Inf`的转化取消。


即可将两次NaN和Inf的转化取消。

即可将第一步和第十步的两次NaN和Inf的转化取消

Asthestarsfalll · 2022-03-28T09:18:14Z

已全部修改

luotao1

LGTM，关于paddle.sort不能处理nan的问题，已经在修复中了。PaddlePaddle/Paddle#41070 可以等该PR merge后（预计这周内merge）再调整下实现方案。

luotao1 · 2022-03-30T01:55:49Z

@Asthestarsfalll PaddlePaddle/Paddle#41070 关于paddle.sort在CPU上不能支持NAN的问题，已经修复了。你可以拉一下develop验证下，然后修改下设计文档。有任何问题请随时联系我们

Asthestarsfalll · 2022-03-30T03:35:26Z

@Asthestarsfalll PaddlePaddle/Paddle#41070 关于paddle.sort在CPU上不能支持NAN的问题，已经修复了。你可以拉一下develop验证下，然后修改下设计文档。有任何问题请随时联系我们

感谢如此及时的修复！晚点我会验证并修改。

…to nanquantile

luotao1 · 2022-03-31T02:25:52Z

晚点我会验证并修改

已经看到修改后的RFC文档了，如果验证成功请在评论区告知

Asthestarsfalll · 2022-03-31T10:51:41Z

如果验证成功请在评论区告知

抱歉，刚刚看到，已经验证成功了！

luotao1

LGTM

add rfc of nanquantile api

7b2198c

paddle-bot-old bot added contributor status: proposed labels Mar 20, 2022

Asthestarsfalll mentioned this pull request Mar 20, 2022

【PaddlePaddle Hackathon 第二期】任务总览 PaddlePaddle/Paddle#40234

Closed

dingjiaweiww assigned DDDivano and luotao1 Mar 21, 2022

luotao1 reviewed Mar 21, 2022

View reviewed changes

dingjiaweiww added status: revision and removed status: proposed labels Mar 22, 2022

Asthestarsfalll added 2 commits March 25, 2022 19:51

Update rfc of nanquantile api

016ad35

Merge branch 'master' of https://github.com/PaddlePaddle/community in…

83ad3a9

…to nanquantile

luotao1 reviewed Mar 28, 2022

View reviewed changes

Update rfc of nanquantile api

529ddc6

luotao1 previously approved these changes Mar 29, 2022

View reviewed changes

Asthestarsfalll added 2 commits March 30, 2022 21:37

update rfc of nanquantile

1569774

Merge branch 'master' of https://github.com/PaddlePaddle/community in…

2ad9d74

…to nanquantile

Asthestarsfalll dismissed luotao1’s stale review via 2ad9d74 March 30, 2022 13:57

luotao1 approved these changes Apr 1, 2022

View reviewed changes

luotao1 merged commit 6926edd into PaddlePaddle:master Apr 1, 2022

Asthestarsfalll mentioned this pull request Apr 2, 2022

【Hackathon No.25】为 Paddle 新增 nanquantile 数学计算API PaddlePaddle/Paddle#41343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Hackathon No.25】为paddle新增 nanquantile #55

【Hackathon No.25】为paddle新增 nanquantile #55

Asthestarsfalll commented Mar 20, 2022

paddle-bot-old bot commented Mar 20, 2022

luotao1 Mar 21, 2022

luotao1 Mar 21, 2022

Asthestarsfalll Mar 21, 2022

luotao1 Mar 21, 2022

Asthestarsfalll Mar 21, 2022 •

edited

Loading

luotao1 Mar 22, 2022

luotao1 Mar 21, 2022

luotao1 Mar 21, 2022

luotao1 Mar 21, 2022

luotao1 Mar 21, 2022

luotao1 Mar 28, 2022

luotao1 Mar 21, 2022

luotao1 Mar 21, 2022

paddle-bot-old bot commented Mar 22, 2022

Asthestarsfalll commented Mar 22, 2022

luotao1 commented Mar 24, 2022

Asthestarsfalll commented Mar 25, 2022

luotao1 Mar 28, 2022

luotao1 Mar 28, 2022

luotao1 Mar 28, 2022

luotao1 Mar 28, 2022

Asthestarsfalll commented Mar 28, 2022

luotao1 left a comment

luotao1 commented Mar 30, 2022

Asthestarsfalll commented Mar 30, 2022 •

edited

Loading

luotao1 commented Mar 31, 2022

Asthestarsfalll commented Mar 31, 2022

luotao1 left a comment


		## 1、相关背景

		为了提升飞桨API丰富度，支持科学计算领域API，Paddle需要扩充API`paddle.nanquantile`以及`paddle.Tensor.nanquantile`.paddle.nanquantile 是 [paddle.quantile](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/quantile_cn.html) 的变体，即沿给定的轴计算非nan元素的分位数。


		## API实现方案

		主要按下列步骤进行组合实现, 实现位置为`paddle/tensor/stat.py`与`mean`,`median`等方法放在一起：


		# 八、影响面

		为独立新增API，对其他模块没有影响。


		1. 对`mask`使用`paddle.logical_not`取反，在指定维度上求和，得到每个位置上的有效数字的个数，这是一个矩阵；

		2. 对替换后的tensor使用`paddle.sort`，因为`sort`不能对含有`NaN`的输入进行正确排序；


		2. 对替换后的tensor使用`paddle.sort`，因为`sort`不能对含有`NaN`的输入进行正确排序；

		2. 使用上述有效数字矩阵-1乘以`q`:[0, 1]得到`indices`:[-1, dim_wo_nan-1]；

【Hackathon No.25】为paddle新增 nanquantile #55

【Hackathon No.25】为paddle新增 nanquantile #55

Conversation

Asthestarsfalll commented Mar 20, 2022

paddle-bot-old bot commented Mar 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Asthestarsfalll Mar 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddle-bot-old bot commented Mar 22, 2022

Asthestarsfalll commented Mar 22, 2022

luotao1 commented Mar 24, 2022

Asthestarsfalll commented Mar 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Asthestarsfalll commented Mar 28, 2022

luotao1 left a comment

Choose a reason for hiding this comment

luotao1 commented Mar 30, 2022

Asthestarsfalll commented Mar 30, 2022 • edited Loading

luotao1 commented Mar 31, 2022

Asthestarsfalll commented Mar 31, 2022

luotao1 left a comment

Choose a reason for hiding this comment

Asthestarsfalll Mar 21, 2022 •

edited

Loading

Asthestarsfalll commented Mar 30, 2022 •

edited

Loading