From 49dffed6da62fd64f7d4f685b573fac8b6c87a0c Mon Sep 17 00:00:00 2001
From: PommesPeter <434596665@qq.com>
Date: Sat, 9 Jul 2022 17:37:58 +0800
Subject: [PATCH 1/7] added bucketize rfc docs

---
 .../APIs/20220709_api_design_for_bucketize.md | 158 ++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 rfcs/APIs/20220709_api_design_for_bucketize.md

diff --git a/rfcs/APIs/20220709_api_design_for_bucketize.md b/rfcs/APIs/20220709_api_design_for_bucketize.md
new file mode 100644
index 000000000..0afcb6c11
--- /dev/null
+++ b/rfcs/APIs/20220709_api_design_for_bucketize.md
@@ -0,0 +1,158 @@
+# paddle.bucketize 设计文档
+
+| API 名称     |                paddle.bucketize           |
+| ------------ | ---------------------------------------- |
+| 提交作者     | PommesPeter                               |
+| 提交时间     | 2022-07-09                                |
+| 版本号       | V1.0                                      |
+| 依赖飞桨版本  | develop                                   |
+| 文件名       | 20220709_api_design_for_bucketize.md      |
+
+# 一、概述
+
+## 1、相关背景
+
+为了提升飞桨 API 丰富度，支持科学计算相关 API，Paddle 需要扩充 API `paddle.bucketize`。
+
+## 2、功能目标
+
+增加 API `paddle.bucketize`，用于根据 `sorted_sequence` 序列计算出 `x` 中每个元素的区间索引。
+
+## 3、意义
+
+为 Paddle 增加神经网络相关的距离计算函数，丰富 `paddle` 中科学计算相关的 API。
+
+# 二、飞桨现状
+
+- 目前 Paddle 缺少 `bucketize` API，但是存在 `searchsorted` API，参考其他框架可以发现，没有专门针对一维 `sorted_sequence` 进行计算的 api，直接使用 `searchsorted` API 导致花费时间在判断维度上。
+- 该 API 的实现及测试主要参考目前 Paddle 中含有的 `paddle.searchsorted`。
+
+# 三、业内方案调研
+
+## PyTorch
+
+PyTorch 中有 `torch.bucketize` 的API，详细参数为 `torch.bucketize(input, boundaries, *, out_int32=False, right=False, out=None) → Tensor`。
+
+在 PyTorch 中的介绍为：
+
+> Returns the indices of the buckets to which each value in the `input` belongs, where the boundaries of the buckets are set by `boundaries`. Return a new tensor with the same size as `input`. If `right` is False (default), then the left boundary is closed. More formally, the returned index satisfies the following rules:
+>
+> | `right` | *returned index satisfies*                                |
+> | ------- | --------------------------------------------------------- |
+> | False   | `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` |
+> | True    | `boundaries[i-1] <= input[m][n]...[l][x] < boundaries[i]` |
+
+在实现方法上，PyTorch 是通过 C++ API 组合实现的，[代码位置](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Bucketization.cpp)
+
+参数表：
+
+- input（Tensor or Scalar）：N-D Tensor，
+
+- boundaries（Tensor）：，1-D Tensor，必须包含一个单调递增的序列。
+
+- out_int32（bool，optional）：指明输出数据类型。如果是True，则输出torch.int32；如果是False，则输出torch.int64。默认是False。
+
+- right（bool，optional）：如果为 False，返回找到的第一个合适的位置； 如果为 True，返回最后一个这样的索引； 如果没有找到合适的索引，则返回0作为非数值值(例如，Nan，Inf)或边界的大小（通过最后一个索引）。
+
+  换句话说，如果为 False，则从边界获取输入中每个值的下界索引； 如果为 True，则获取上界索引。 默认值为 False。
+
+- out（Tensor，optional）：输出的Tensor必须和输出的Tensor大小相同。
+
+## Tensorflow
+
+Tensorflow 中有 `tf.transform.bucketize` API，具体参数为 `tft.bucketize( x: common_types.ConsistentTensorType, num_buckets: int, epsilon: Optional[float] = None, weights: Optional[tf.Tensor] = None, elementwise: bool = False, name: Optional[str] = None) -> common_types.ConsistentTensorType`
+
+在实现方法上，Tensorflow 是通过 Python API 的方式组合实现的，[代码位置](https://github.com/tensorflow/transform/blob/d0c3349403120a2cf1177c111b674c07e9b38398/tensorflow_transform/mappers.py#L1690-L1770)
+
+参数表：
+
+| Args          |                                                              |
+| :------------ | ------------------------------------------------------------ |
+| `x`           | 一个数字输入的 `Tensor`或`CompositeTensor`，其值应被映射到桶中。对于一个`CompositeTensor`，只有非缺失的值才会被包括在定量计算中，`bucketize`的结果将是一个`CompositeTensor`，其非缺失的值被映射到桶中。如果 elementwise=True，那么`x`必须是密集的。 |
+| `num_buckets` | 输入的`x`中的值被分成大小大致相等的桶，桶的数量是`num_buckets`。 |
+| `epsilon`     | （可选）误差容限，通常是一个接近于零的小部分。如果调用者没有指定一个值，将根据实验结果计算出一个合适的值。对于小于 100 的`num_buckets`，选择 0.01 的值来处理高达约 1 万亿的输入数据值的数据集。如果`num_buckets`更大，那么 epsilon 被设置为 (1 / `num_buckets`) 以执行更严格的误差容忍度，因为更多的桶将导致每个桶的范围更小，所以我们希望边界不那么模糊。详情见analyzers.quantiles()。 |
+| `weights`     | （可选）用于定量的权重张量。张量必须与 x 具有相同的形状。    |
+| `elementwise` | （可选）如果为真，对 tensor 的每个元素进行独立的桶化。       |
+| `name`        | (可选) 该操作的名称。                                        |
+
+# 四、对比分析
+
+## 共同点
+
+- 都能实现根据 `sorted_sequence` 计算出输入 `x` 中每个元素所对应的区间索引
+
+## 不同点
+
+- PyTorch 是在 C++ API 基础上实现，使用 Python 调用 C++ 对应的接口。
+- PyTorch 输入参数比较简单，可选的操作比较少。
+- Tensorflow 则是通过 Python API 直接实现其对应的功能。
+- Tensorflow 有 `num_buckets`、`epsilon`、`weights` 等参数的设置，可调整的程度更高。
+
+
+# 五、设计思路与实现方案
+
+## 命名与参数设计
+
+添加 API
+
+```python
+paddle.bucketize(
+    x: Tensor,
+    sorted_sequence: Tensor,
+    out_int32: bool=False,
+    right: bool=False,
+    name: str=None
+)
+```
+
+## 底层 OP 设计
+
+使用已有的 API 组合实现，不再单独设计 OP。
+
+## API 实现方案
+
+该 API 实现于 `python/paddle/tensor/search.py`
+
+首先，`bucketize` 主要针对一维情况下的 `sorted_sequence`，所以需要对输入的维度大小进行判断，通过断言进行判断，当输入维度不为 1 时触发 `AssertError`。
+
+随后，Paddle 中已有 `searchsorted` API 的具体实现逻辑，位于 `python/paddle/tensor/search.py` 下的 `searchsorted` 函数中，因此只需要调用其函数即可。
+
+# 六、测试和验收的考量
+
+测试需要考虑的 case 如下：
+
+- 数值结果的一致性，使用 numpy 作为参考标准
+- 参数 `right` 为 True 和 False 时输出的正确性
+- 参数 `out_int32` 为 True 和 False 时 dtype 输出的正确性；
+- 未输入 `right` 时的输出正确性；
+- 未输入 `out_int32` 时的输出正确性；
+
+# 七、可行性分析和排期规划
+
+方案主要依赖现有 Paddle API 组合而成，且依赖的 `paddle.searchsorted` 已经在 Paddle repo 的 [python/paddle/tensor/search.py](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/search.py#L910)。工期上可以满足在当前版本周期内开发完成。
+
+# 八、影响面
+
+新增 API，对其他模块是否有影响
+
+# 名词解释
+
+无
+
+# 附件及参考资料
+
+## PyTorch
+
+[torch.bucketize](https://pytorch.org/docs/stable/generated/torch.bucketize.html)
+
+[torch.searchsorted](https://pytorch.org/docs/stable/generated/torch.searchsorted.html?highlight=searchsorted#torch.searchsorted)
+
+## tensorflow
+
+[tf.transform.bucketize](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/bucketize)
+
+[tf.searchsorted](https://www.tensorflow.org/api_docs/python/tf/searchsorted)
+
+## Paddle
+
+[paddle.searchsorted](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/searchsorted_cn.html)
\ No newline at end of file

From b604beb9fd5066226e1eaa6fdfd9ff790176bf12 Mon Sep 17 00:00:00 2001
From: PommesPeter <434596665@qq.com>
Date: Sun, 10 Jul 2022 16:52:54 +0800
Subject: [PATCH 2/7] updated bucketize rfc docs

---
 .../APIs/20220709_api_design_for_bucketize.md | 312 ++++++++++++++++++
 1 file changed, 312 insertions(+)

diff --git a/rfcs/APIs/20220709_api_design_for_bucketize.md b/rfcs/APIs/20220709_api_design_for_bucketize.md
index 0afcb6c11..f84ecca1f 100644
--- a/rfcs/APIs/20220709_api_design_for_bucketize.md
+++ b/rfcs/APIs/20220709_api_design_for_bucketize.md
@@ -44,6 +44,236 @@ PyTorch 中有 `torch.bucketize` 的API，详细参数为 `torch.bucketize(input
 
 在实现方法上，PyTorch 是通过 C++ API 组合实现的，[代码位置](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Bucketization.cpp)
 
+实现代码：
+```cpp
+#include <ATen/Dispatch.h>
+#include <ATen/Functions.h>
+#include <ATen/Parallel.h>
+#include <ATen/native/BucketizationUtils.h>
+#include <ATen/native/Resize.h>
+#include <c10/util/irange.h>
+
+/* Implement a numpy like searchsorted and a TF like bucketize function running on cpu
+ *
+ * - torch.searchsorted(sorted_sequence, values, right=False, side='left', out_int32=False, sorter=None)
+ *   sorted_sequence - N*D or 1D (apply to all values) tensor containing sorted sequences in last dimension
+ *   values          - N*D tensor or a Scalar (when sorted_sequence is 1D) containing the search values
+ *   right           - corresponding to lower bound if False and upper bound if True
+ *   side            - (preferred to right) corresponding to lower bound if 'left' and upper bound if 'right'
+ *   out_int32       - the output tensor is int64_t type if False and int(32bit normally) type if True.
+ *   sorter          - if provided, sorted_sequence may not be sorted and the sorted order is given by this tensor
+ *
+ * - torch.bucketize(values, boundaries, right=False, out_int32=False)
+ *   values     - N*D tensor or a Scalar containing the search value
+ *   boundaries - 1D tensor containing a sorted sequences
+ *   right      - corresponding to lower bound if False and upper bound if True
+ *   out_int32  - the output tensor is int64_t type if False and int(32bit normally) type if True.
+ *
+ * - Restrictions are defined in searchsorted_pre_check()
+ */
+
+namespace at {
+namespace native {
+
+namespace {
+
+// minimal size for searchsorted_cpu_contiguous to run parallel (multithread)
+constexpr int64_t SEARCHSORTED_GRAIN_SIZE = 200;
+
+// customized lower_bound func to ensure the low bound of 'nan', 'inf' etc. be the end of boundary
+// and we can properly handle a sorter argument
+// std::lower_bound can not be used here since its customized comparator need strict weak ordering
+// and the customized comparators require both arguments to have the same type, which wouldn't
+// happen when comparing val of input_t to an indexer value from sorter of int64
+template<typename input_t>
+int64_t cus_lower_bound(int64_t start, int64_t end, const input_t val, const input_t* bd, const int64_t* sort) {
+  // sorter gives relative ordering for ND tensors, so we need to save and add the non-updated start as an offset
+  // i.e. the second row of a 3x3 tensors starts at element 3 but sorter's second row only contains 0, 1, or 2
+  const int64_t orig_start = start;
+  while (start < end) {
+    const int64_t mid = start + ((end - start) >> 1);
+    const input_t mid_val = sort ? bd[sort[mid] + orig_start] : bd[mid];
+    if (!(mid_val >= val)) {
+      start = mid + 1;
+    }
+    else {
+      end = mid;
+    }
+  }
+  return start;
+}
+
+// customized upper_bound func to ensure we can properly handle a sorter argument
+// std::upper_bound can not be used here since its customized comparator requires both arguments to have the
+// same type, which wouldn't happen when comparing val of input_t to an indexer value from sorter of int64
+template<typename input_t>
+int64_t cus_upper_bound(int64_t start, int64_t end, const input_t val, const input_t* bd, const int64_t* sort) {
+  // sorter gives relative ordering for ND tensors, so we need to save and add the non-updated start as an offset
+  // i.e. the second row of a 3x3 tensors starts at element 3 but sorter's second row only contains 0, 1, or 2
+  const int64_t orig_start = start;
+  while (start < end) {
+    const int64_t mid = start + ((end - start) >> 1);
+    const input_t mid_val = sort ? bd[sort[mid] + orig_start] : bd[mid];
+    if (!(mid_val > val)) {
+      start = mid + 1;
+    }
+    else {
+      end = mid;
+    }
+  }
+  return start;
+}
+
+template<typename input_t, typename output_t>
+void searchsorted_cpu_contiguous(Tensor& result, const Tensor& input, const Tensor& boundaries, const bool& right, const Tensor& sorter) {
+  int64_t numel_in = input.numel();
+  bool is_scalar_input = input.dim() == 0 && numel_in == 1;
+  // inner most dim size of input and boundaries
+  int64_t idim_in = is_scalar_input ? 1 : input.sizes().back();
+  int64_t idim_bd = boundaries.sizes().back();
+
+  const input_t *data_in = input.data_ptr<input_t>();
+  const input_t *data_bd = boundaries.data_ptr<input_t>();
+  const int64_t *data_st = sorter.defined() ? sorter.data_ptr<int64_t>() : nullptr;
+  output_t *data_out = result.data_ptr<output_t>();
+
+  bool is_1d_boundaries = boundaries.dim() == 1;
+  at::parallel_for(0, numel_in, SEARCHSORTED_GRAIN_SIZE, [&](int64_t start, int64_t end) {
+    for (const auto i : c10::irange(start, end)) {
+      // If boundaries tensor is 1d, we always search the entire boundary tensor
+      int64_t start_bd = is_1d_boundaries ? 0 : i / idim_in * idim_bd;
+      int64_t end_bd = start_bd + idim_bd;
+
+      int64_t pos = !right ?
+        cus_lower_bound(start_bd, end_bd, data_in[i], data_bd, data_st) - start_bd :
+        cus_upper_bound(start_bd, end_bd, data_in[i], data_bd, data_st) - start_bd;
+
+      // type conversion might happen here
+      data_out[i] = pos;
+    }
+  });
+}
+
+void dispatch(Tensor& result, const Tensor& input, const Tensor& boundaries, bool out_int32, bool right, const Tensor& sorter) {
+  if (!out_int32) {
+    AT_DISPATCH_ALL_TYPES_AND2(
+        ScalarType::Half,
+        ScalarType::BFloat16,
+        input.scalar_type(),
+        "searchsorted_out_cpu",
+        [&] {
+          searchsorted_cpu_contiguous<scalar_t, int64_t>(
+              result, input, boundaries, right, sorter);
+        });
+  }
+  else {
+    AT_DISPATCH_ALL_TYPES_AND2(
+        ScalarType::Half,
+        ScalarType::BFloat16,
+        input.scalar_type(),
+        "searchsorted_out_cpu",
+        [&] {
+          searchsorted_cpu_contiguous<scalar_t, int>(
+              result, input, boundaries, right, sorter);
+        });
+  }
+}
+
+}
+
+Tensor& searchsorted_out_cpu(
+    const Tensor& sorted_sequence,
+    const Tensor& self,
+    bool out_int32,
+    bool right,
+    const c10::optional<c10::string_view> side_opt,
+    const c10::optional<Tensor>& sorter_opt,
+    Tensor& result) {
+  // See [Note: hacky wrapper removal for optional tensor]
+  c10::MaybeOwned<Tensor> sorter_maybe_owned = at::borrow_from_optional_tensor(sorter_opt);
+  const Tensor& sorter = *sorter_maybe_owned;
+  searchsorted_pre_check(sorted_sequence, self, result, out_int32, right, side_opt, sorter);
+  resize_output(result, self.sizes());
+
+  // we have two inputs to set right, pre_check checks that they aren't set to opposites
+  bool is_right = side_opt ? *side_opt == "right" : right;
+
+  if (self.numel() == 0) {
+    return result;
+  }
+
+  // for non-contiguous result tensors, we write the output to a contiguous copy so we can later copy back, maintaing the original result tensor
+  Tensor out = result;
+  if (!result.is_contiguous()) {
+    out = result.contiguous();
+  }
+  if (sorted_sequence.is_contiguous() && self.is_contiguous() && sorted_sequence.dtype() == self.dtype() && sorter.is_contiguous()) {
+    dispatch(out, self, sorted_sequence, out_int32, is_right, sorter);
+  }
+  else {
+    Tensor trimmed_input;
+    Tensor trimmed_boundaries;
+    Tensor trimmed_sorter;
+    searchsorted_maybe_trim_input_tensors(trimmed_input, trimmed_boundaries, trimmed_sorter, self, sorted_sequence, sorter);
+    const Tensor& final_input = trimmed_input.defined() ? trimmed_input : self;
+    const Tensor& final_boundaries = trimmed_boundaries.defined() ? trimmed_boundaries : sorted_sequence;
+    const Tensor& final_sorter = trimmed_sorter.defined() ? trimmed_sorter : sorter;
+    dispatch(out, final_input, final_boundaries, out_int32, is_right, final_sorter);
+  }
+
+  // if result is non-contiguous, we wrote the answer to a copied version, so we copy back to the original result tensor
+  if (!result.is_contiguous()) {
+    result.copy_(out);
+  }
+  return result;
+}
+
+Tensor searchsorted_cpu(
+      const Tensor& sorted_sequence,
+      const Tensor& self,
+      bool out_int32,
+      bool right,
+      const c10::optional<c10::string_view> side_opt,
+      const c10::optional<Tensor>& sorter_opt) {
+  ScalarType scalar_type = out_int32 ? ScalarType::Int : ScalarType::Long;
+  c10::TensorOptions options = TensorOptions().device(self.options().device()).dtype(scalar_type);
+  Tensor result = at::empty({0}, options, MemoryFormat::Contiguous);
+  at::native::searchsorted_out_cpu(sorted_sequence, self, out_int32, right, side_opt, sorter_opt, result);
+  return result;
+}
+
+Tensor searchsorted_cpu(
+    const Tensor& sorted_sequence,
+    const Scalar& self,
+    bool out_int32,
+    bool right,
+    const c10::optional<c10::string_view> side_opt,
+    const c10::optional<Tensor>& sorter_opt) {
+  const Tensor& scalar_tensor = searchsorted_scalar_tensor(self, sorted_sequence.device());
+  return searchsorted_cpu(sorted_sequence, scalar_tensor, out_int32, right, side_opt, sorter_opt);
+}
+
+Tensor& bucketize_out_cpu(const Tensor& self, const Tensor& boundaries, bool out_int32, bool right, Tensor& result) {
+  TORCH_CHECK(boundaries.dim() == 1, "boundaries tensor must be 1 dimension, but got dim(", boundaries.dim(), ")");
+  at::native::searchsorted_out_cpu(boundaries, self, out_int32, right, nullopt, nullopt, result);
+  return result;
+}
+
+Tensor bucketize_cpu(const Tensor& self, const Tensor& boundaries, bool out_int32, bool right) {
+  ScalarType scalar_type = out_int32 ? ScalarType::Int : ScalarType::Long;
+  c10::TensorOptions options = TensorOptions().device(self.options().device()).dtype(scalar_type);
+  Tensor result = at::empty({0}, options, MemoryFormat::Contiguous);
+  at::native::bucketize_out_cpu(self, boundaries, out_int32, right, result);
+  return result;
+}
+
+Tensor bucketize_cpu(const Scalar& self, const Tensor& boundaries, bool out_int32, bool right) {
+  return bucketize_cpu(searchsorted_scalar_tensor(self, boundaries.device()), boundaries, out_int32, right);
+}
+
+}} // namespace at::native
+```
+
 参数表：
 
 - input（Tensor or Scalar）：N-D Tensor，
@@ -64,6 +294,88 @@ Tensorflow 中有 `tf.transform.bucketize` API，具体参数为 `tft.bucketize(
 
 在实现方法上，Tensorflow 是通过 Python API 的方式组合实现的，[代码位置](https://github.com/tensorflow/transform/blob/d0c3349403120a2cf1177c111b674c07e9b38398/tensorflow_transform/mappers.py#L1690-L1770)
 
+代码实现：
+```python
+@common.log_api_use(common.MAPPER_COLLECTION)
+def bucketize(x: common_types.ConsistentTensorType,
+              num_buckets: int,
+              epsilon: Optional[float] = None,
+              weights: Optional[tf.Tensor] = None,
+              elementwise: bool = False,
+              name: Optional[str] = None) -> common_types.ConsistentTensorType:
+  """Returns a bucketized column, with a bucket index assigned to each input.
+  Args:
+    x: A numeric input `Tensor` or `CompositeTensor` whose values should be
+      mapped to buckets.  For a `CompositeTensor` only non-missing values will
+      be included in the quantiles computation, and the result of `bucketize`
+      will be a `CompositeTensor` with non-missing values mapped to buckets. If
+      elementwise=True then `x` must be dense.
+    num_buckets: Values in the input `x` are divided into approximately
+      equal-sized buckets, where the number of buckets is `num_buckets`.
+    epsilon: (Optional) Error tolerance, typically a small fraction close to
+      zero. If a value is not specified by the caller, a suitable value is
+      computed based on experimental results.  For `num_buckets` less than 100,
+      the value of 0.01 is chosen to handle a dataset of up to ~1 trillion input
+      data values.  If `num_buckets` is larger, then epsilon is set to
+      (1/`num_buckets`) to enforce a stricter error tolerance, because more
+      buckets will result in smaller range for each bucket, and so we want the
+      boundaries to be less fuzzy. See analyzers.quantiles() for details.
+    weights: (Optional) Weights tensor for the quantiles. Tensor must have the
+      same shape as x.
+    elementwise: (Optional) If true, bucketize each element of the tensor
+      independently.
+    name: (Optional) A name for this operation.
+  Returns:
+    A `Tensor` of the same shape as `x`, with each element in the
+    returned tensor representing the bucketized value. Bucketized value is
+    in the range [0, actual_num_buckets). Sometimes the actual number of buckets
+    can be different than num_buckets hint, for example in case the number of
+    distinct values is smaller than num_buckets, or in cases where the
+    input values are not uniformly distributed.
+    NaN values are mapped to the last bucket. Values with NaN weights are
+    ignored in bucket boundaries calculation.
+  Raises:
+    TypeError: If num_buckets is not an int.
+    ValueError: If value of num_buckets is not > 1.
+    ValueError: If elementwise=True and x is a `CompositeTensor`.
+  """
+  with tf.compat.v1.name_scope(name, 'bucketize'):
+    if not isinstance(num_buckets, int):
+      raise TypeError('num_buckets must be an int, got %s' % type(num_buckets))
+
+    if num_buckets < 1:
+      raise ValueError('Invalid num_buckets %d' % num_buckets)
+
+    if isinstance(x, (tf.SparseTensor, tf.RaggedTensor)) and elementwise:
+      raise ValueError(
+          'bucketize requires `x` to be dense if `elementwise=True`')
+
+    if epsilon is None:
+      # See explanation in args documentation for epsilon.
+      epsilon = min(1.0 / num_buckets, 0.01)
+
+    x_values = tf_utils.get_values(x)
+    bucket_boundaries = analyzers.quantiles(
+        x_values,
+        num_buckets,
+        epsilon,
+        weights,
+        reduce_instance_dims=not elementwise)
+
+    if not elementwise:
+      return apply_buckets(x, bucket_boundaries)
+
+    num_features = tf.math.reduce_prod(x.get_shape()[1:])
+    bucket_boundaries = tf.reshape(bucket_boundaries, [num_features, -1])
+    x_reshaped = tf.reshape(x, [-1, num_features])
+    bucketized = []
+    for idx, boundaries in enumerate(tf.unstack(bucket_boundaries, axis=0)):
+      bucketized.append(apply_buckets(x_reshaped[:, idx],
+                                      tf.expand_dims(boundaries, axis=0)))
+    return tf.reshape(tf.stack(bucketized, axis=1),
+                      [-1] + x.get_shape().as_list()[1:])
+```
+
 参数表：
 
 | Args          |                                                              |

From 8413595cb21b7404c73cd333d17a555b31df8397 Mon Sep 17 00:00:00 2001
From: PommesPeter <434596665@qq.com>
Date: Mon, 11 Jul 2022 18:06:21 +0800
Subject: [PATCH 3/7] update: modified part 3 and part 6

---
 .../APIs/20220709_api_design_for_bucketize.md | 200 +-----------------
 1 file changed, 7 insertions(+), 193 deletions(-)

diff --git a/rfcs/APIs/20220709_api_design_for_bucketize.md b/rfcs/APIs/20220709_api_design_for_bucketize.md
index f84ecca1f..85e5c202c 100644
--- a/rfcs/APIs/20220709_api_design_for_bucketize.md
+++ b/rfcs/APIs/20220709_api_design_for_bucketize.md
@@ -46,140 +46,16 @@ PyTorch 中有 `torch.bucketize` 的API，详细参数为 `torch.bucketize(input
 
 实现代码：
 ```cpp
-#include <ATen/Dispatch.h>
-#include <ATen/Functions.h>
-#include <ATen/Parallel.h>
-#include <ATen/native/BucketizationUtils.h>
-#include <ATen/native/Resize.h>
-#include <c10/util/irange.h>
-
-/* Implement a numpy like searchsorted and a TF like bucketize function running on cpu
- *
- * - torch.searchsorted(sorted_sequence, values, right=False, side='left', out_int32=False, sorter=None)
- *   sorted_sequence - N*D or 1D (apply to all values) tensor containing sorted sequences in last dimension
- *   values          - N*D tensor or a Scalar (when sorted_sequence is 1D) containing the search values
- *   right           - corresponding to lower bound if False and upper bound if True
- *   side            - (preferred to right) corresponding to lower bound if 'left' and upper bound if 'right'
- *   out_int32       - the output tensor is int64_t type if False and int(32bit normally) type if True.
- *   sorter          - if provided, sorted_sequence may not be sorted and the sorted order is given by this tensor
- *
- * - torch.bucketize(values, boundaries, right=False, out_int32=False)
- *   values     - N*D tensor or a Scalar containing the search value
- *   boundaries - 1D tensor containing a sorted sequences
- *   right      - corresponding to lower bound if False and upper bound if True
- *   out_int32  - the output tensor is int64_t type if False and int(32bit normally) type if True.
- *
- * - Restrictions are defined in searchsorted_pre_check()
- */
-
 namespace at {
 namespace native {
 
 namespace {
 
-// minimal size for searchsorted_cpu_contiguous to run parallel (multithread)
-constexpr int64_t SEARCHSORTED_GRAIN_SIZE = 200;
-
-// customized lower_bound func to ensure the low bound of 'nan', 'inf' etc. be the end of boundary
-// and we can properly handle a sorter argument
-// std::lower_bound can not be used here since its customized comparator need strict weak ordering
-// and the customized comparators require both arguments to have the same type, which wouldn't
-// happen when comparing val of input_t to an indexer value from sorter of int64
-template<typename input_t>
-int64_t cus_lower_bound(int64_t start, int64_t end, const input_t val, const input_t* bd, const int64_t* sort) {
-  // sorter gives relative ordering for ND tensors, so we need to save and add the non-updated start as an offset
-  // i.e. the second row of a 3x3 tensors starts at element 3 but sorter's second row only contains 0, 1, or 2
-  const int64_t orig_start = start;
-  while (start < end) {
-    const int64_t mid = start + ((end - start) >> 1);
-    const input_t mid_val = sort ? bd[sort[mid] + orig_start] : bd[mid];
-    if (!(mid_val >= val)) {
-      start = mid + 1;
-    }
-    else {
-      end = mid;
-    }
-  }
-  return start;
-}
+// ...
 
-// customized upper_bound func to ensure we can properly handle a sorter argument
-// std::upper_bound can not be used here since its customized comparator requires both arguments to have the
-// same type, which wouldn't happen when comparing val of input_t to an indexer value from sorter of int64
-template<typename input_t>
-int64_t cus_upper_bound(int64_t start, int64_t end, const input_t val, const input_t* bd, const int64_t* sort) {
-  // sorter gives relative ordering for ND tensors, so we need to save and add the non-updated start as an offset
-  // i.e. the second row of a 3x3 tensors starts at element 3 but sorter's second row only contains 0, 1, or 2
-  const int64_t orig_start = start;
-  while (start < end) {
-    const int64_t mid = start + ((end - start) >> 1);
-    const input_t mid_val = sort ? bd[sort[mid] + orig_start] : bd[mid];
-    if (!(mid_val > val)) {
-      start = mid + 1;
-    }
-    else {
-      end = mid;
-    }
-  }
-  return start;
 }
 
-template<typename input_t, typename output_t>
-void searchsorted_cpu_contiguous(Tensor& result, const Tensor& input, const Tensor& boundaries, const bool& right, const Tensor& sorter) {
-  int64_t numel_in = input.numel();
-  bool is_scalar_input = input.dim() == 0 && numel_in == 1;
-  // inner most dim size of input and boundaries
-  int64_t idim_in = is_scalar_input ? 1 : input.sizes().back();
-  int64_t idim_bd = boundaries.sizes().back();
-
-  const input_t *data_in = input.data_ptr<input_t>();
-  const input_t *data_bd = boundaries.data_ptr<input_t>();
-  const int64_t *data_st = sorter.defined() ? sorter.data_ptr<int64_t>() : nullptr;
-  output_t *data_out = result.data_ptr<output_t>();
-
-  bool is_1d_boundaries = boundaries.dim() == 1;
-  at::parallel_for(0, numel_in, SEARCHSORTED_GRAIN_SIZE, [&](int64_t start, int64_t end) {
-    for (const auto i : c10::irange(start, end)) {
-      // If boundaries tensor is 1d, we always search the entire boundary tensor
-      int64_t start_bd = is_1d_boundaries ? 0 : i / idim_in * idim_bd;
-      int64_t end_bd = start_bd + idim_bd;
-
-      int64_t pos = !right ?
-        cus_lower_bound(start_bd, end_bd, data_in[i], data_bd, data_st) - start_bd :
-        cus_upper_bound(start_bd, end_bd, data_in[i], data_bd, data_st) - start_bd;
-
-      // type conversion might happen here
-      data_out[i] = pos;
-    }
-  });
-}
-
-void dispatch(Tensor& result, const Tensor& input, const Tensor& boundaries, bool out_int32, bool right, const Tensor& sorter) {
-  if (!out_int32) {
-    AT_DISPATCH_ALL_TYPES_AND2(
-        ScalarType::Half,
-        ScalarType::BFloat16,
-        input.scalar_type(),
-        "searchsorted_out_cpu",
-        [&] {
-          searchsorted_cpu_contiguous<scalar_t, int64_t>(
-              result, input, boundaries, right, sorter);
-        });
-  }
-  else {
-    AT_DISPATCH_ALL_TYPES_AND2(
-        ScalarType::Half,
-        ScalarType::BFloat16,
-        input.scalar_type(),
-        "searchsorted_out_cpu",
-        [&] {
-          searchsorted_cpu_contiguous<scalar_t, int>(
-              result, input, boundaries, right, sorter);
-        });
-  }
-}
-
-}
+// ...
 
 Tensor& searchsorted_out_cpu(
     const Tensor& sorted_sequence,
@@ -189,20 +65,18 @@ Tensor& searchsorted_out_cpu(
     const c10::optional<c10::string_view> side_opt,
     const c10::optional<Tensor>& sorter_opt,
     Tensor& result) {
-  // See [Note: hacky wrapper removal for optional tensor]
+
   c10::MaybeOwned<Tensor> sorter_maybe_owned = at::borrow_from_optional_tensor(sorter_opt);
   const Tensor& sorter = *sorter_maybe_owned;
   searchsorted_pre_check(sorted_sequence, self, result, out_int32, right, side_opt, sorter);
   resize_output(result, self.sizes());
 
-  // we have two inputs to set right, pre_check checks that they aren't set to opposites
   bool is_right = side_opt ? *side_opt == "right" : right;
 
   if (self.numel() == 0) {
     return result;
   }
 
-  // for non-contiguous result tensors, we write the output to a contiguous copy so we can later copy back, maintaing the original result tensor
   Tensor out = result;
   if (!result.is_contiguous()) {
     out = result.contiguous();
@@ -221,38 +95,12 @@ Tensor& searchsorted_out_cpu(
     dispatch(out, final_input, final_boundaries, out_int32, is_right, final_sorter);
   }
 
-  // if result is non-contiguous, we wrote the answer to a copied version, so we copy back to the original result tensor
   if (!result.is_contiguous()) {
     result.copy_(out);
   }
   return result;
 }
 
-Tensor searchsorted_cpu(
-      const Tensor& sorted_sequence,
-      const Tensor& self,
-      bool out_int32,
-      bool right,
-      const c10::optional<c10::string_view> side_opt,
-      const c10::optional<Tensor>& sorter_opt) {
-  ScalarType scalar_type = out_int32 ? ScalarType::Int : ScalarType::Long;
-  c10::TensorOptions options = TensorOptions().device(self.options().device()).dtype(scalar_type);
-  Tensor result = at::empty({0}, options, MemoryFormat::Contiguous);
-  at::native::searchsorted_out_cpu(sorted_sequence, self, out_int32, right, side_opt, sorter_opt, result);
-  return result;
-}
-
-Tensor searchsorted_cpu(
-    const Tensor& sorted_sequence,
-    const Scalar& self,
-    bool out_int32,
-    bool right,
-    const c10::optional<c10::string_view> side_opt,
-    const c10::optional<Tensor>& sorter_opt) {
-  const Tensor& scalar_tensor = searchsorted_scalar_tensor(self, sorted_sequence.device());
-  return searchsorted_cpu(sorted_sequence, scalar_tensor, out_int32, right, side_opt, sorter_opt);
-}
-
 Tensor& bucketize_out_cpu(const Tensor& self, const Tensor& boundaries, bool out_int32, bool right, Tensor& result) {
   TORCH_CHECK(boundaries.dim() == 1, "boundaries tensor must be 1 dimension, but got dim(", boundaries.dim(), ")");
   at::native::searchsorted_out_cpu(boundaries, self, out_int32, right, nullopt, nullopt, result);
@@ -303,42 +151,6 @@ def bucketize(x: common_types.ConsistentTensorType,
               weights: Optional[tf.Tensor] = None,
               elementwise: bool = False,
               name: Optional[str] = None) -> common_types.ConsistentTensorType:
-  """Returns a bucketized column, with a bucket index assigned to each input.
-  Args:
-    x: A numeric input `Tensor` or `CompositeTensor` whose values should be
-      mapped to buckets.  For a `CompositeTensor` only non-missing values will
-      be included in the quantiles computation, and the result of `bucketize`
-      will be a `CompositeTensor` with non-missing values mapped to buckets. If
-      elementwise=True then `x` must be dense.
-    num_buckets: Values in the input `x` are divided into approximately
-      equal-sized buckets, where the number of buckets is `num_buckets`.
-    epsilon: (Optional) Error tolerance, typically a small fraction close to
-      zero. If a value is not specified by the caller, a suitable value is
-      computed based on experimental results.  For `num_buckets` less than 100,
-      the value of 0.01 is chosen to handle a dataset of up to ~1 trillion input
-      data values.  If `num_buckets` is larger, then epsilon is set to
-      (1/`num_buckets`) to enforce a stricter error tolerance, because more
-      buckets will result in smaller range for each bucket, and so we want the
-      boundaries to be less fuzzy. See analyzers.quantiles() for details.
-    weights: (Optional) Weights tensor for the quantiles. Tensor must have the
-      same shape as x.
-    elementwise: (Optional) If true, bucketize each element of the tensor
-      independently.
-    name: (Optional) A name for this operation.
-  Returns:
-    A `Tensor` of the same shape as `x`, with each element in the
-    returned tensor representing the bucketized value. Bucketized value is
-    in the range [0, actual_num_buckets). Sometimes the actual number of buckets
-    can be different than num_buckets hint, for example in case the number of
-    distinct values is smaller than num_buckets, or in cases where the
-    input values are not uniformly distributed.
-    NaN values are mapped to the last bucket. Values with NaN weights are
-    ignored in bucket boundaries calculation.
-  Raises:
-    TypeError: If num_buckets is not an int.
-    ValueError: If value of num_buckets is not > 1.
-    ValueError: If elementwise=True and x is a `CompositeTensor`.
-  """
   with tf.compat.v1.name_scope(name, 'bucketize'):
     if not isinstance(num_buckets, int):
       raise TypeError('num_buckets must be an int, got %s' % type(num_buckets))
@@ -433,9 +245,11 @@ paddle.bucketize(
 
 测试需要考虑的 case 如下：
 
-- 数值结果的一致性，使用 numpy 作为参考标准
+- 输出数值结果的一致性，使用 numpy 作为参考标准
 - 参数 `right` 为 True 和 False 时输出的正确性
-- 参数 `out_int32` 为 True 和 False 时 dtype 输出的正确性；
+- 参数 `out_int32` 为 True 和 False 时 dtype 输出的正确性
+- 参数 `x` 类型的正确性，若类型不为 Tensor 则抛出异常
+- 参数 `sorted_sequence` 的维度正确性，该 API 只针对 `sorted_sequence` 是一维的情况，所以对于输入需要约束
 - 未输入 `right` 时的输出正确性；
 - 未输入 `out_int32` 时的输出正确性；
 

From 0e8bdfda8dc95ba274a2ead45a674b7c0ea3d7de Mon Sep 17 00:00:00 2001
From: PommesPeter <434596665@qq.com>
Date: Tue, 21 Feb 2023 15:11:02 +0800
Subject: [PATCH 4/7] [Doc] Added rfc design docs

---
 rfcs/APIs/20230221_api_design_for_polor.md | 226 +++++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 rfcs/APIs/20230221_api_design_for_polor.md

diff --git a/rfcs/APIs/20230221_api_design_for_polor.md b/rfcs/APIs/20230221_api_design_for_polor.md
new file mode 100644
index 000000000..417a8cc28
--- /dev/null
+++ b/rfcs/APIs/20230221_api_design_for_polor.md
@@ -0,0 +1,226 @@
+# paddle.polar 设计文档
+
+| API 名称     |                paddle.polar           |
+| ------------ | ---------------------------------------- |
+| 提交作者     | PommesPeter                               |
+| 提交时间     | 2023-02-21                                |
+| 版本号       | V1.0                                      |
+| 依赖飞桨版本  | develop                                   |
+| 文件名       | 20220709_api_design_for_polar.md      |
+
+# 一、概述
+
+## 1、相关背景
+
+为了提升飞桨 API 丰富度，支持科学计算相关 API，Paddle 需要扩充 API `paddle.polar`。
+
+## 2、功能目标
+
+增加 API `paddle.polar`，通过输入模和相位角，`elementwise` 构造复数 tensor。方便计算极坐标系下的运算。
+
+## 3、意义
+
+为 Paddle 增加极坐标和复数的计算函数，丰富 `paddle` 中科学计算相关的 API。
+
+# 二、飞桨现状
+
+- 目前 Paddle 缺少 `polar` API，但是存在 `paddle.complex`，参考其他框架可以发现，Paddle 没有专门针对极坐标系下进行计算的 api，无法构建极坐标，直接使用 `paddle.complex` 代码不够清晰易读。
+- 该 API 的实现及测试主要参考目前 Paddle 中含有的 `paddle.complex`。
+
+# 三、业内方案调研
+
+## PyTorch
+
+PyTorch 中有 `torch.polar` 的API，详细参数为 `torch.polar(abs, angle, *, out=None) → Tensor`。
+
+在 PyTorch 中的介绍为：
+
+> Constructs a complex tensor whose elements are Cartesian coordinates corresponding to the polar coordinates with absolute value `abs` and angle `angle`.
+
+在实现方法上，PyTorch 是通过 C++ API 组合实现的，[代码位置](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorFactories.cpp#L190-L251)
+
+实现代码：
+
+```cpp
+// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ complex / polar ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+void complex_check_floating(const Tensor& a, const Tensor& b) {
+  TORCH_CHECK((a.scalar_type() == kFloat || a.scalar_type() == kDouble || a.scalar_type() == kHalf) &&
+              (b.scalar_type() == kFloat || b.scalar_type() == kDouble || b.scalar_type() == kHalf),
+              "Expected both inputs to be Half, Float or Double tensors but got ",
+              a.scalar_type(), " and ", b.scalar_type());
+}
+
+void complex_check_dtype(
+    const Tensor& result,
+    const Tensor& a,
+    const Tensor& b) {
+  complex_check_floating(a, b);
+  TORCH_CHECK(a.scalar_type() == b.scalar_type(),
+              "Expected object of scalar type ", a.scalar_type(),
+              " but got scalar type ", b.scalar_type(), " for second argument");
+  TORCH_CHECK(result.scalar_type() == toComplexType(a.scalar_type()),
+              "Expected object of scalar type ", toComplexType(a.scalar_type()),
+              " but got scalar type ", result.scalar_type(),
+              " for argument 'out'");
+}
+
+Tensor& complex_out(const Tensor& real, const Tensor& imag, Tensor& result) {
+  complex_check_dtype(result, real, imag);
+  auto iter = TensorIteratorConfig()
+      .add_output(result)
+      .add_input(real)
+      .add_input(imag)
+      .check_all_same_dtype(false)
+      .build();
+  complex_stub(iter.device_type(), iter);
+  return result;
+}
+
+Tensor complex(const Tensor& real, const Tensor& imag) {
+  complex_check_floating(real, imag);
+  c10::TensorOptions options = real.options();
+  options = options.dtype(toComplexType(real.scalar_type()));
+  Tensor result = at::empty(0, options);
+  return at::complex_out(result, real, imag);
+}
+
+Tensor& polar_out(const Tensor& abs, const Tensor& angle, Tensor& result) {
+  complex_check_dtype(result, abs, angle);
+  auto iter = TensorIteratorConfig()
+      .add_output(result)
+      .add_input(abs)
+      .add_input(angle)
+      .check_all_same_dtype(false)
+      .build();
+  polar_stub(iter.device_type(), iter);
+  return result;
+}
+
+Tensor polar(const Tensor& abs, const Tensor& angle) {
+  complex_check_floating(abs, angle);
+  c10::TensorOptions options = abs.options();
+  options = options.dtype(toComplexType(abs.scalar_type()));
+  Tensor result = at::empty(0, options);
+  return at::polar_out(result, abs, angle);
+}
+}
+```
+
+参数表：
+
+- abs：复数张量的绝对值。必须为 float 或 double。
+- angle：复数张量的角度。数据类型必须与abs相同。
+- out：如果输入为 torch.float32，则必须为 torch.complex64。如果输入为 torch.float64，则必须为 torch.complex128。
+
+## SciPy
+
+实现方法上，Scipy 是通过 Python API 的方式组合实现的，[代码位置](https://github.com/scipy/scipy/blob/v1.10.1/scipy/linalg/_decomp_polar.py#L8-L111)
+
+代码实现：
+```python
+def polar(a, side="right"):
+    if side not in ['right', 'left']:
+        raise ValueError("`side` must be either 'right' or 'left'")
+    a = np.asarray(a)
+    if a.ndim != 2:
+        raise ValueError("`a` must be a 2-D array.")
+
+    w, s, vh = svd(a, full_matrices=False)
+    u = w.dot(vh)
+    if side == 'right':
+        # a = up
+        p = (vh.T.conj() * s).dot(vh)
+    else:
+        # a = pu
+        p = (w * s).dot(w.T.conj())
+    return u, p
+```
+
+参数表：
+
+- Parameters:
+    - a: (m, n) array_like
+        The array to be factored.
+    - side: {‘left’, ‘right’}, optional
+        Determines whether a right or left polar decomposition is computed. If side is “right”, then a = up. If side is “left”, then a = pu. The default is “right”.
+
+- Returns:
+    - u: (m, n) ndarray
+        If a is square, then u is unitary. If m > n, then the columns of a are orthonormal, and if m < n, then the rows of u are orthonormal.
+    - p: ndarray
+        p is Hermitian positive semidefinite. If a is nonsingular, p is positive definite. The shape of p is (n, n) or (m, m), depending on whether side is “right” or “left”, respectively.
+
+# 四、对比分析
+
+## 共同点
+
+- 都能通过输入模和相位角，`elementwise` 构造复数 tensor。方便计算极坐标系下的运算。
+
+## 不同点
+
+- PyTorch 是在 C++ API 基础上实现，使用 Python 调用 C++ 对应的接口。
+- Scipy 则是通过 Python API 直接实现其对应的功能。
+- Tensorflow 有 `a`、`side` 等参数的设置，可调整的程度更高。
+
+# 五、设计思路与实现方案
+
+## 命名与参数设计
+
+添加 API
+
+```python
+paddle.polar(
+    abs: Tensor,
+    angle: Tensor,
+    name: str=None
+)
+```
+
+## 底层OP设计
+
+使用已有的 API 组合实现，不再单独设计 OP。
+
+需要注意：如果输入是 torch.float32，则必须是 torch.complex64。如果输入是 torch.float64，则必须是 torch.complex128。
+
+## API实现方案
+
+该 API 实现于 `python/paddle/tensor/creation.py`
+
+通过调研发现，计算该极坐标可以使用复数计算，Paddle 本身已实现 `paddle.complex`，可利用已有 API 实现。代入公式：
+
+$$
+\text{out} = \text{abs}\cdot\cos(\text{angle}) + \text{abs}\cdot\sin(\text{angle})\cdot j
+$$
+
+即可得到对应模和相位角的极坐标以及所对应的笛卡尔坐标。
+
+随后，Paddle 中已有 `complex` API 的具体实现逻辑，位于 `python/paddle/tensor/creation.py` 下的 `complex` 函数中，因此只需要调用其函数构造复数即可。
+
+# 六、测试和验收的考量
+
+测试需要考虑的 case 如下：
+
+- 输出数值结果的一致性和数据类型是否正确，使用 pytorch 或 scipy 作为参考标准
+- 参数 `abs` 的数据类型准确性判断
+- 参数 `angle` 的数据类型准确性判断、
+
+# 七、可行性分析和排期规划
+
+方案主要依赖现有 Paddle API 组合而成，且依赖的 `paddle.complex` 已经在 Paddle repo 的 [python/paddle/tensor/search.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/tensor/creation.py#L2160-L2209)。工期上可以满足在当前版本周期内开发完成。
+
+# 八、影响面
+
+新增 API，对其他模块无有影响
+
+# 名词解释
+
+无
+
+# 附件及参考资料
+
+[torch.polar](https://pytorch.org/docs/stable/generated/torch.polar.html)
+
+[scipy.linalg.polar](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.polar.html)
+
+[paddle.complex](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/tensor/creation.py#L2160-L2209)
\ No newline at end of file

From c04d6576a7029e54b5a4da1193e002576b9a7cd6 Mon Sep 17 00:00:00 2001
From: PommesPeter <434596665@qq.com>
Date: Tue, 21 Feb 2023 15:55:25 +0800
Subject: [PATCH 5/7] [Doc] Deleted unused rfc design docs

---
 .../APIs/20220709_api_design_for_bucketize.md | 284 ------------------
 1 file changed, 284 deletions(-)
 delete mode 100644 rfcs/APIs/20220709_api_design_for_bucketize.md

diff --git a/rfcs/APIs/20220709_api_design_for_bucketize.md b/rfcs/APIs/20220709_api_design_for_bucketize.md
deleted file mode 100644
index 85e5c202c..000000000
--- a/rfcs/APIs/20220709_api_design_for_bucketize.md
+++ /dev/null
@@ -1,284 +0,0 @@
-# paddle.bucketize 设计文档
-
-| API 名称     |                paddle.bucketize           |
-| ------------ | ---------------------------------------- |
-| 提交作者     | PommesPeter                               |
-| 提交时间     | 2022-07-09                                |
-| 版本号       | V1.0                                      |
-| 依赖飞桨版本  | develop                                   |
-| 文件名       | 20220709_api_design_for_bucketize.md      |
-
-# 一、概述
-
-## 1、相关背景
-
-为了提升飞桨 API 丰富度，支持科学计算相关 API，Paddle 需要扩充 API `paddle.bucketize`。
-
-## 2、功能目标
-
-增加 API `paddle.bucketize`，用于根据 `sorted_sequence` 序列计算出 `x` 中每个元素的区间索引。
-
-## 3、意义
-
-为 Paddle 增加神经网络相关的距离计算函数，丰富 `paddle` 中科学计算相关的 API。
-
-# 二、飞桨现状
-
-- 目前 Paddle 缺少 `bucketize` API，但是存在 `searchsorted` API，参考其他框架可以发现，没有专门针对一维 `sorted_sequence` 进行计算的 api，直接使用 `searchsorted` API 导致花费时间在判断维度上。
-- 该 API 的实现及测试主要参考目前 Paddle 中含有的 `paddle.searchsorted`。
-
-# 三、业内方案调研
-
-## PyTorch
-
-PyTorch 中有 `torch.bucketize` 的API，详细参数为 `torch.bucketize(input, boundaries, *, out_int32=False, right=False, out=None) → Tensor`。
-
-在 PyTorch 中的介绍为：
-
-> Returns the indices of the buckets to which each value in the `input` belongs, where the boundaries of the buckets are set by `boundaries`. Return a new tensor with the same size as `input`. If `right` is False (default), then the left boundary is closed. More formally, the returned index satisfies the following rules:
->
-> | `right` | *returned index satisfies*                                |
-> | ------- | --------------------------------------------------------- |
-> | False   | `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` |
-> | True    | `boundaries[i-1] <= input[m][n]...[l][x] < boundaries[i]` |
-
-在实现方法上，PyTorch 是通过 C++ API 组合实现的，[代码位置](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Bucketization.cpp)
-
-实现代码：
-```cpp
-namespace at {
-namespace native {
-
-namespace {
-
-// ...
-
-}
-
-// ...
-
-Tensor& searchsorted_out_cpu(
-    const Tensor& sorted_sequence,
-    const Tensor& self,
-    bool out_int32,
-    bool right,
-    const c10::optional<c10::string_view> side_opt,
-    const c10::optional<Tensor>& sorter_opt,
-    Tensor& result) {
-
-  c10::MaybeOwned<Tensor> sorter_maybe_owned = at::borrow_from_optional_tensor(sorter_opt);
-  const Tensor& sorter = *sorter_maybe_owned;
-  searchsorted_pre_check(sorted_sequence, self, result, out_int32, right, side_opt, sorter);
-  resize_output(result, self.sizes());
-
-  bool is_right = side_opt ? *side_opt == "right" : right;
-
-  if (self.numel() == 0) {
-    return result;
-  }
-
-  Tensor out = result;
-  if (!result.is_contiguous()) {
-    out = result.contiguous();
-  }
-  if (sorted_sequence.is_contiguous() && self.is_contiguous() && sorted_sequence.dtype() == self.dtype() && sorter.is_contiguous()) {
-    dispatch(out, self, sorted_sequence, out_int32, is_right, sorter);
-  }
-  else {
-    Tensor trimmed_input;
-    Tensor trimmed_boundaries;
-    Tensor trimmed_sorter;
-    searchsorted_maybe_trim_input_tensors(trimmed_input, trimmed_boundaries, trimmed_sorter, self, sorted_sequence, sorter);
-    const Tensor& final_input = trimmed_input.defined() ? trimmed_input : self;
-    const Tensor& final_boundaries = trimmed_boundaries.defined() ? trimmed_boundaries : sorted_sequence;
-    const Tensor& final_sorter = trimmed_sorter.defined() ? trimmed_sorter : sorter;
-    dispatch(out, final_input, final_boundaries, out_int32, is_right, final_sorter);
-  }
-
-  if (!result.is_contiguous()) {
-    result.copy_(out);
-  }
-  return result;
-}
-
-Tensor& bucketize_out_cpu(const Tensor& self, const Tensor& boundaries, bool out_int32, bool right, Tensor& result) {
-  TORCH_CHECK(boundaries.dim() == 1, "boundaries tensor must be 1 dimension, but got dim(", boundaries.dim(), ")");
-  at::native::searchsorted_out_cpu(boundaries, self, out_int32, right, nullopt, nullopt, result);
-  return result;
-}
-
-Tensor bucketize_cpu(const Tensor& self, const Tensor& boundaries, bool out_int32, bool right) {
-  ScalarType scalar_type = out_int32 ? ScalarType::Int : ScalarType::Long;
-  c10::TensorOptions options = TensorOptions().device(self.options().device()).dtype(scalar_type);
-  Tensor result = at::empty({0}, options, MemoryFormat::Contiguous);
-  at::native::bucketize_out_cpu(self, boundaries, out_int32, right, result);
-  return result;
-}
-
-Tensor bucketize_cpu(const Scalar& self, const Tensor& boundaries, bool out_int32, bool right) {
-  return bucketize_cpu(searchsorted_scalar_tensor(self, boundaries.device()), boundaries, out_int32, right);
-}
-
-}} // namespace at::native
-```
-
-参数表：
-
-- input（Tensor or Scalar）：N-D Tensor，
-
-- boundaries（Tensor）：，1-D Tensor，必须包含一个单调递增的序列。
-
-- out_int32（bool，optional）：指明输出数据类型。如果是True，则输出torch.int32；如果是False，则输出torch.int64。默认是False。
-
-- right（bool，optional）：如果为 False，返回找到的第一个合适的位置； 如果为 True，返回最后一个这样的索引； 如果没有找到合适的索引，则返回0作为非数值值(例如，Nan，Inf)或边界的大小（通过最后一个索引）。
-
-  换句话说，如果为 False，则从边界获取输入中每个值的下界索引； 如果为 True，则获取上界索引。 默认值为 False。
-
-- out（Tensor，optional）：输出的Tensor必须和输出的Tensor大小相同。
-
-## Tensorflow
-
-Tensorflow 中有 `tf.transform.bucketize` API，具体参数为 `tft.bucketize( x: common_types.ConsistentTensorType, num_buckets: int, epsilon: Optional[float] = None, weights: Optional[tf.Tensor] = None, elementwise: bool = False, name: Optional[str] = None) -> common_types.ConsistentTensorType`
-
-在实现方法上，Tensorflow 是通过 Python API 的方式组合实现的，[代码位置](https://github.com/tensorflow/transform/blob/d0c3349403120a2cf1177c111b674c07e9b38398/tensorflow_transform/mappers.py#L1690-L1770)
-
-代码实现：
-```python
-@common.log_api_use(common.MAPPER_COLLECTION)
-def bucketize(x: common_types.ConsistentTensorType,
-              num_buckets: int,
-              epsilon: Optional[float] = None,
-              weights: Optional[tf.Tensor] = None,
-              elementwise: bool = False,
-              name: Optional[str] = None) -> common_types.ConsistentTensorType:
-  with tf.compat.v1.name_scope(name, 'bucketize'):
-    if not isinstance(num_buckets, int):
-      raise TypeError('num_buckets must be an int, got %s' % type(num_buckets))
-
-    if num_buckets < 1:
-      raise ValueError('Invalid num_buckets %d' % num_buckets)
-
-    if isinstance(x, (tf.SparseTensor, tf.RaggedTensor)) and elementwise:
-      raise ValueError(
-          'bucketize requires `x` to be dense if `elementwise=True`')
-
-    if epsilon is None:
-      # See explanation in args documentation for epsilon.
-      epsilon = min(1.0 / num_buckets, 0.01)
-
-    x_values = tf_utils.get_values(x)
-    bucket_boundaries = analyzers.quantiles(
-        x_values,
-        num_buckets,
-        epsilon,
-        weights,
-        reduce_instance_dims=not elementwise)
-
-    if not elementwise:
-      return apply_buckets(x, bucket_boundaries)
-
-    num_features = tf.math.reduce_prod(x.get_shape()[1:])
-    bucket_boundaries = tf.reshape(bucket_boundaries, [num_features, -1])
-    x_reshaped = tf.reshape(x, [-1, num_features])
-    bucketized = []
-    for idx, boundaries in enumerate(tf.unstack(bucket_boundaries, axis=0)):
-      bucketized.append(apply_buckets(x_reshaped[:, idx],
-                                      tf.expand_dims(boundaries, axis=0)))
-    return tf.reshape(tf.stack(bucketized, axis=1),
-                      [-1] + x.get_shape().as_list()[1:])
-```
-
-参数表：
-
-| Args          |                                                              |
-| :------------ | ------------------------------------------------------------ |
-| `x`           | 一个数字输入的 `Tensor`或`CompositeTensor`，其值应被映射到桶中。对于一个`CompositeTensor`，只有非缺失的值才会被包括在定量计算中，`bucketize`的结果将是一个`CompositeTensor`，其非缺失的值被映射到桶中。如果 elementwise=True，那么`x`必须是密集的。 |
-| `num_buckets` | 输入的`x`中的值被分成大小大致相等的桶，桶的数量是`num_buckets`。 |
-| `epsilon`     | （可选）误差容限，通常是一个接近于零的小部分。如果调用者没有指定一个值，将根据实验结果计算出一个合适的值。对于小于 100 的`num_buckets`，选择 0.01 的值来处理高达约 1 万亿的输入数据值的数据集。如果`num_buckets`更大，那么 epsilon 被设置为 (1 / `num_buckets`) 以执行更严格的误差容忍度，因为更多的桶将导致每个桶的范围更小，所以我们希望边界不那么模糊。详情见analyzers.quantiles()。 |
-| `weights`     | （可选）用于定量的权重张量。张量必须与 x 具有相同的形状。    |
-| `elementwise` | （可选）如果为真，对 tensor 的每个元素进行独立的桶化。       |
-| `name`        | (可选) 该操作的名称。                                        |
-
-# 四、对比分析
-
-## 共同点
-
-- 都能实现根据 `sorted_sequence` 计算出输入 `x` 中每个元素所对应的区间索引
-
-## 不同点
-
-- PyTorch 是在 C++ API 基础上实现，使用 Python 调用 C++ 对应的接口。
-- PyTorch 输入参数比较简单，可选的操作比较少。
-- Tensorflow 则是通过 Python API 直接实现其对应的功能。
-- Tensorflow 有 `num_buckets`、`epsilon`、`weights` 等参数的设置，可调整的程度更高。
-
-
-# 五、设计思路与实现方案
-
-## 命名与参数设计
-
-添加 API
-
-```python
-paddle.bucketize(
-    x: Tensor,
-    sorted_sequence: Tensor,
-    out_int32: bool=False,
-    right: bool=False,
-    name: str=None
-)
-```
-
-## 底层 OP 设计
-
-使用已有的 API 组合实现，不再单独设计 OP。
-
-## API 实现方案
-
-该 API 实现于 `python/paddle/tensor/search.py`
-
-首先，`bucketize` 主要针对一维情况下的 `sorted_sequence`，所以需要对输入的维度大小进行判断，通过断言进行判断，当输入维度不为 1 时触发 `AssertError`。
-
-随后，Paddle 中已有 `searchsorted` API 的具体实现逻辑，位于 `python/paddle/tensor/search.py` 下的 `searchsorted` 函数中，因此只需要调用其函数即可。
-
-# 六、测试和验收的考量
-
-测试需要考虑的 case 如下：
-
-- 输出数值结果的一致性，使用 numpy 作为参考标准
-- 参数 `right` 为 True 和 False 时输出的正确性
-- 参数 `out_int32` 为 True 和 False 时 dtype 输出的正确性
-- 参数 `x` 类型的正确性，若类型不为 Tensor 则抛出异常
-- 参数 `sorted_sequence` 的维度正确性，该 API 只针对 `sorted_sequence` 是一维的情况，所以对于输入需要约束
-- 未输入 `right` 时的输出正确性；
-- 未输入 `out_int32` 时的输出正确性；
-
-# 七、可行性分析和排期规划
-
-方案主要依赖现有 Paddle API 组合而成，且依赖的 `paddle.searchsorted` 已经在 Paddle repo 的 [python/paddle/tensor/search.py](https://github.com/PaddlePaddle/Paddle/blob/release/2.3/python/paddle/tensor/search.py#L910)。工期上可以满足在当前版本周期内开发完成。
-
-# 八、影响面
-
-新增 API，对其他模块是否有影响
-
-# 名词解释
-
-无
-
-# 附件及参考资料
-
-## PyTorch
-
-[torch.bucketize](https://pytorch.org/docs/stable/generated/torch.bucketize.html)
-
-[torch.searchsorted](https://pytorch.org/docs/stable/generated/torch.searchsorted.html?highlight=searchsorted#torch.searchsorted)
-
-## tensorflow
-
-[tf.transform.bucketize](https://www.tensorflow.org/tfx/transform/api_docs/python/tft/bucketize)
-
-[tf.searchsorted](https://www.tensorflow.org/api_docs/python/tf/searchsorted)
-
-## Paddle
-
-[paddle.searchsorted](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/searchsorted_cn.html)
\ No newline at end of file

From 5c62c3aeb5ce68d9592039ebeaee815128cb7da2 Mon Sep 17 00:00:00 2001
From: PommesPeter <434596665@qq.com>
Date: Tue, 21 Feb 2023 16:00:36 +0800
Subject: [PATCH 6/7] [Doc] Deleted a errors

---
 rfcs/APIs/20230221_api_design_for_polor.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rfcs/APIs/20230221_api_design_for_polor.md b/rfcs/APIs/20230221_api_design_for_polor.md
index 417a8cc28..5ed707892 100644
--- a/rfcs/APIs/20230221_api_design_for_polor.md
+++ b/rfcs/APIs/20230221_api_design_for_polor.md
@@ -211,7 +211,7 @@ $$
 
 # 八、影响面
 
-新增 API，对其他模块无有影响
+新增 API，对其他模块无影响
 
 # 名词解释
 

From 1bf37cdc7fc39028ff7426b10e0cb85e3217bd90 Mon Sep 17 00:00:00 2001
From: PommesPeter <434596665@qq.com>
Date: Tue, 21 Feb 2023 16:56:34 +0800
Subject: [PATCH 7/7] [Doc] Updated API rfc doc

---
 rfcs/APIs/20230221_api_design_for_polor.md | 44 +---------------------
 1 file changed, 2 insertions(+), 42 deletions(-)

diff --git a/rfcs/APIs/20230221_api_design_for_polor.md b/rfcs/APIs/20230221_api_design_for_polor.md
index 5ed707892..c3a5eaf0d 100644
--- a/rfcs/APIs/20230221_api_design_for_polor.md
+++ b/rfcs/APIs/20230221_api_design_for_polor.md
@@ -113,44 +113,6 @@ Tensor polar(const Tensor& abs, const Tensor& angle) {
 - angle：复数张量的角度。数据类型必须与abs相同。
 - out：如果输入为 torch.float32，则必须为 torch.complex64。如果输入为 torch.float64，则必须为 torch.complex128。
 
-## SciPy
-
-实现方法上，Scipy 是通过 Python API 的方式组合实现的，[代码位置](https://github.com/scipy/scipy/blob/v1.10.1/scipy/linalg/_decomp_polar.py#L8-L111)
-
-代码实现：
-```python
-def polar(a, side="right"):
-    if side not in ['right', 'left']:
-        raise ValueError("`side` must be either 'right' or 'left'")
-    a = np.asarray(a)
-    if a.ndim != 2:
-        raise ValueError("`a` must be a 2-D array.")
-
-    w, s, vh = svd(a, full_matrices=False)
-    u = w.dot(vh)
-    if side == 'right':
-        # a = up
-        p = (vh.T.conj() * s).dot(vh)
-    else:
-        # a = pu
-        p = (w * s).dot(w.T.conj())
-    return u, p
-```
-
-参数表：
-
-- Parameters:
-    - a: (m, n) array_like
-        The array to be factored.
-    - side: {‘left’, ‘right’}, optional
-        Determines whether a right or left polar decomposition is computed. If side is “right”, then a = up. If side is “left”, then a = pu. The default is “right”.
-
-- Returns:
-    - u: (m, n) ndarray
-        If a is square, then u is unitary. If m > n, then the columns of a are orthonormal, and if m < n, then the rows of u are orthonormal.
-    - p: ndarray
-        p is Hermitian positive semidefinite. If a is nonsingular, p is positive definite. The shape of p is (n, n) or (m, m), depending on whether side is “right” or “left”, respectively.
-
 # 四、对比分析
 
 ## 共同点
@@ -160,8 +122,6 @@ def polar(a, side="right"):
 ## 不同点
 
 - PyTorch 是在 C++ API 基础上实现，使用 Python 调用 C++ 对应的接口。
-- Scipy 则是通过 Python API 直接实现其对应的功能。
-- Tensorflow 有 `a`、`side` 等参数的设置，可调整的程度更高。
 
 # 五、设计思路与实现方案
 
@@ -201,13 +161,13 @@ $$
 
 测试需要考虑的 case 如下：
 
-- 输出数值结果的一致性和数据类型是否正确，使用 pytorch 或 scipy 作为参考标准
+- 输出数值结果的一致性和数据类型是否正确，使用 pytorch 作为参考标准
 - 参数 `abs` 的数据类型准确性判断
 - 参数 `angle` 的数据类型准确性判断、
 
 # 七、可行性分析和排期规划
 
-方案主要依赖现有 Paddle API 组合而成，且依赖的 `paddle.complex` 已经在 Paddle repo 的 [python/paddle/tensor/search.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/tensor/creation.py#L2160-L2209)。工期上可以满足在当前版本周期内开发完成。
+方案主要依赖现有 Paddle API 组合而成，且依赖的 `paddle.complex` 已经在 Paddle repo 的 [python/paddle/tensor/creation.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/tensor/creation.py#L2160-L2209)。工期上可以满足在当前版本周期内开发完成。
 
 # 八、影响面