From 51212f4a1638155331bf8edae87058dfbd08723b Mon Sep 17 00:00:00 2001 From: megemini Date: Thu, 12 Oct 2023 16:18:41 +0800 Subject: [PATCH 1/6] [Add] add Hackathon 5th No.38 rfc --- ...1009_api_design_for_fractional_max_pool.md | 1126 +++++++++++++++++ 1 file changed, 1126 insertions(+) create mode 100644 rfcs/APIs/20231009_api_design_for_fractional_max_pool.md diff --git a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md new file mode 100644 index 000000000..cbb3f8c3c --- /dev/null +++ b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md @@ -0,0 +1,1126 @@ +# FractionalMaxPool2d / FractionalMaxPool3d API 设计文档 + +| API 名称 | FractionalMaxPool2d / FractionalMaxPool3d | +| - | - | +| 提交作者 | megemini(柳顺) | +| 提交时间 | 2023-10-09 | +| 版本号 | V1.0 | +| 依赖飞桨版本 | develop | +| 文件名 | 20231009_api_design_for_fractional_max_pool.md | + +# 一、概述 + +## 1、相关背景 + +[《Fractional Max-Pooling》](https://arxiv.org/abs/1412.6071) 这篇文章介绍了一种 `fractional` 的池化方法,区别与传统的池化方法,如 `max-pooling`,`《Fractional Max-Pooling》` 的池化因子可以在 `1 < alpha < 2` 之间,也就是说,每次池化操作可以将输入缩小诸如 `sqrt(2)` 倍,而不是简单的 `2` 倍。比如,可以将输入尺寸为 `25` 缩小为输出 `18`,此时 `alpha = 25/18 = 1.39`。 + +文章中提到,这种池化方法可以防止传统池化方式快速缩小输入尺寸,从而影响性能的问题。可以介由网络对于更多不同尺寸输入的识别,以提升模型整体的识别能力。 + +飞桨目前实现了诸如 `max-pooling`、`avg-pooling` 等方法,但没有实现 `fractional max pooling`,此次实现 `fractional max pool2d / fractional max pool3d` 以提升飞桨 API 的丰富程度。 + +## 2、功能目标 + +在一个由多个通道组成的输入信号上施加分数最大池化。分数最大池化请参考论文 [《Fractional Max-Pooling》](https://arxiv.org/abs/1412.6071) +调用形式 +- `paddle.nn.FractionalMaxPool2d` +- `paddle.nn.FractionalMaxPool3d` +- `paddle.nn.functional.fractional_max_pool2d` +- `paddle.nn.functional.fractional_max_pool3d` + +## 3、意义 + +为 `Paddle` 增加 `Fractional Max-Pooling` 操作,丰富 `Paddle` 中池化操作相关的 API。 + +# 二、飞桨现状 + +飞桨目前已经提供了诸多的池化方法,如:`max_poolNd`、`avg_poolNd` 等,但尚未提供 `fractional_max_pool` 方法,底层也没有相关算子的实现。 + +飞桨目前将池化操作相关函数放在 `python/paddle/nn/functional/pooling.py` 文件中,另外,在 `python/paddle/nn/layer/pooling.py` 中提供了构造网络需要的模块。其中对应的 `layer` 层,均可通过调用 `functional` 相关函数实现。 + +由此,`paddle.nn.FractionalMaxPoolNd` 可以通过调用 `paddle.nn.functional.fractional_max_poolNd` 实现。 + +# 三、业内方案调研 + +## 算法逻辑 + +对比 `2*2 max pooling` (2MP) ,2MP 的采样序列为 `22222...`,如果将其中混杂 `1`,如 `1121122112...`,便可以生成 `1 < alpha = N_in/N_out < 2` 的池化结果。 + +因此,算法的关键是如何生成 `1121122112...` 类似的序列,以满足 `output_size` 或 `input_size * output_ratio`。 + +注:这里的 `1` 和 `2` 可以理解为 `kernel/pool size`,也就是每次池化的尺寸,或者是文章中的 `increments`,之所以是 `1`、`2`,前提是 `1 < alpha < 2`,也就是说,这是介于 `原尺寸` 与 `2*2 max pooling` 之间的池化操作。如果 `alpha > 2`,类似于 `3*3 max pooling`,这里的序列可以是任何大于零的整数。后续为简化谈论,假设 `1 < alpha < 2`。 + +文章中介绍了两种方式,`真` 随机(`random`)与 `伪` 随机(`pseudo random`)。 + +- `真` 随机(`random`) + + 随机生成 `1` 和 `2` 的序列,只要满足: + + - 序列长度为 `output_size` + - 序列累加和为 `input_size` + +- `伪` 随机(`pseudo random`) + + 这里生成的累加序列,需要满足: + + `a = ceil(alpha(i+u)), 1 < alpha = N_in/N_out < 2, 0 < u < 1, i = 0,1,2...N_out` + + 长度为 `output_size + 1`,`u` 为随机数,可以利用随机种子固定住。由此生成序列: + + `diff = a[i+1] - a[i]` + +生成随机序列后,便可以利用 `max` 操作,在每个池化窗口取最大值,由此产生最后的输出。 + +## PyTorch + +`PyTorch` 底层通过 c++ 实现 `fractional_max_pool2d / fractional_max_pool3d` 函数,并通过上层的 python 对外开放相应接口。 + +相应的,`FractionalMaxPool2d` 通过 `fractional_max_pool2d` 实现,`FractionalMaxPool3d` 通过 `fractional_max_pool3d` 实现。 + +相应文档: + +- [FRACTIONALMAXPOOL2D](https://pytorch.org/docs/stable/generated/torch.nn.FractionalMaxPool2d.html#fractionalmaxpool2d) +- [FRACTIONALMAXPOOL3D](https://pytorch.org/docs/stable/generated/torch.nn.FractionalMaxPool3d.html#fractionalmaxpool3d) +- [TORCH.NN.FUNCTIONAL.FRACTIONAL_MAX_POOL2D](https://pytorch.org/docs/stable/generated/torch.nn.functional.fractional_max_pool2d.html#torch.nn.functional.fractional_max_pool2d) +- [TORCH.NN.FUNCTIONAL.FRACTIONAL_MAX_POOL3D](https://pytorch.org/docs/stable/generated/torch.nn.functional.fractional_max_pool3d.html#torch.nn.functional.fractional_max_pool3d) + + +相应接口为: + +- `torch.nn.FractionalMaxPool2d(kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` + + - 文档描述 + > Applies 2D fractional max pooling over an input signal composed of several input planes. + + - 参数列表 + > kernel_size – the size of the window to take a max over. + > output_size – the target output size + > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) + > return_indices – if True, will return the indices along with the outputs. + + - 返回值 + > output (Tensor) + +- `torch.nn.FractionalMaxPool3d(kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` + + - 文档描述 + > Applies 3D fractional max pooling over an input signal composed of several input planes. + + - 参数列表 + > kernel_size – the size of the window to take a max over. + > output_size – the target output size + > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) + > return_indices – if True, will return the indices along with the outputs. + + - 返回值 + > output (Tensor) + +- `torch.nn.functional.fractional_max_pool2d(input, kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` + + - 文档描述 + > Applies 2D fractional max pooling over an input signal composed of several input planes. + + - 参数列表 + > kernel_size – the size of the window to take a max over. + > output_size – the target output size + > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) + > return_indices – if True, will return the indices along with the outputs. + + - 返回值 + > output (Tensor) + +- `torch.nn.functional.fractional_max_pool3d(input, kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` + + - 文档描述 + > Applies 3D fractional max pooling over an input signal composed of several input planes. + + - 参数列表 + > kernel_size – the size of the window to take a max over. + > output_size – the target output size + > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) + > return_indices – if True, will return the indices along with the outputs. + + - 返回值 + > output (Tensor) + + +实现逻辑: + +由于 `fractional_max_pool2d` 与 `fractional_max_pool3d` 最大的区别是维度,其他逻辑基本相同,所以,后续以 `fractional_max_pool2d` 为主要分析对象。 + +相关源代码涉及文件: + +- `torch/nn/functional.py` * +- `torch/csrc/api/include/torch/nn/options/pooling.h` +- `torch/csrc/api/include/torch/nn/functional/pooling.h` * +- `torch/csrc/api/include/torch/nn/modules/pooling.h` +- `torch/csrc/api/src/nn/modules/pooling.cpp` +- `aten/src/ATen/native/FractionalMaxPooling.h` * +- `aten/src/ATen/native/FractionalMaxPool2d.cpp` * + +这里只分析上述带有 `*` 的主要源文件。 + +- `torch/nn/functional.py` + + 这里对 `fractional_max_pool2d` 开放 API: + + ``` python + def fractional_max_pool2d_with_indices( + input: Tensor, kernel_size: BroadcastingList2[int], + output_size: Optional[BroadcastingList2[int]] = None, + output_ratio: Optional[BroadcastingList2[float]] = None, + return_indices: bool = False, + _random_samples: Optional[Tensor] = None + ) -> Tuple[Tensor, Tensor]: + r""" + fractional_max_pool2d(input, kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None) + + ... + """ + if has_torch_function_variadic(input, _random_samples): + return handle_torch_function( + fractional_max_pool2d_with_indices, + (input, _random_samples), + input, + kernel_size, + output_size=output_size, + output_ratio=output_ratio, + return_indices=return_indices, + _random_samples=_random_samples, + ) + if output_size is None and output_ratio is None: + raise ValueError("fractional_max_pool2d requires specifying either " "an output_size or an output_ratio") + if output_size is None: + assert output_ratio is not None + if len(output_ratio) > 2: + raise ValueError("fractional_max_pool2d requires output_ratio to either be a single Int or tuple of Ints.") + _output_ratio = _pair(output_ratio) + output_size = [int(input.size(-2) * _output_ratio[0]), int(input.size(-1) * _output_ratio[1])] + + if _random_samples is None: + n_batch = 1 if input.dim() == 3 else input.size(0) + _random_samples = torch.rand(n_batch, input.size(-3), 2, dtype=input.dtype, device=input.device) + return torch._C._nn.fractional_max_pool2d(input, kernel_size, output_size, _random_samples) + + + def _fractional_max_pool2d( + input: Tensor, kernel_size: BroadcastingList2[int], + output_size: Optional[BroadcastingList2[int]] = None, + output_ratio: Optional[BroadcastingList2[float]] = None, + return_indices: bool = False, + _random_samples: Optional[Tensor] = None + ) -> Tensor: + if has_torch_function_variadic(input, _random_samples): + return handle_torch_function( + fractional_max_pool2d, + (input, _random_samples), + input, + kernel_size, + output_size=output_size, + output_ratio=output_ratio, + return_indices=return_indices, + _random_samples=_random_samples, + ) + return fractional_max_pool2d_with_indices( + input, kernel_size, output_size, output_ratio, return_indices, _random_samples + )[0] + + + fractional_max_pool2d = boolean_dispatch( + arg_name="return_indices", + arg_index=4, + default=False, + if_true=fractional_max_pool2d_with_indices, + if_false=_fractional_max_pool2d, + module_name=__name__, + func_name="fractional_max_pool2d", + ) + ``` + + 这里根据是否需要 `indices` 对接口进行分发,最终都是调用 `fractional_max_pool2d_with_indices`。 + + +- `torch/csrc/api/include/torch/nn/functional/pooling.h` + + 上面的接口会调用这里对应的 c++ 实现: + + ``` cpp + namespace detail { + inline std::tuple fractional_max_pool2d_with_indices( + const Tensor& input, + const ExpandingArray<2>& kernel_size, + const c10::optional>& output_size, + const c10::optional>& output_ratio, + const Tensor& _random_samples) { + if (output_size == c10::nullopt && output_ratio == c10::nullopt) { + TORCH_CHECK( + false, + "fractional_max_pool2d requires specifying either ", + "an output_size or an output_ratio"); + } + c10::optional> output_size_ = output_size; + if (output_size_ == c10::nullopt) { + TORCH_INTERNAL_ASSERT(output_ratio != c10::nullopt); + output_size_ = { + (int64_t)(static_cast(input.size(-2)) * (*output_ratio.value())[0]), + (int64_t)(static_cast(input.size(-1)) * (*output_ratio.value())[1])}; + } + + Tensor _random_samples_ = _random_samples; + if (!_random_samples_.defined()) { + auto n_batch = input.dim() == 3 ? 1 : input.size(0); + _random_samples_ = torch::rand( + {n_batch, input.size(-3), 2}, + torch::TensorOptions().dtype(input.dtype()).device(input.device())); + } + return torch::fractional_max_pool2d( + input, kernel_size, *output_size_, _random_samples_); + } + } // namespace detail + ``` + + 这里是 `fractional_max_pool2d` 主要入口,主要做了以下几处处理: + + - 如果没有 `output_size`,根据 `output_ratio` 生成 `output_size` + - 如果没有 `_random_samples`,根据输入的维度生成随机序列 + - 调用主要方法 `torch::fractional_max_pool2d(input, kernel_size, *output_size_, _random_samples_);}` + + +- `aten/src/ATen/native/FractionalMaxPool2d.cpp` + + 这里实现了具体的逻辑: + + ``` cpp + template + static void fractional_max_pool2d_out_single_batch_frame( + scalar_t* input, + scalar_t* output, + int64_t* indices, + scalar_t* randomSamples, + int numPlanes, + int inputW, int inputH, + int outputW, int outputH, + int poolSizeW, int poolSizeH) { + at::parallel_for(0, numPlanes, 0, [&](int64_t start, int64_t end) { + for (const auto plane : c10::irange(start, end)) { + /* each plane contains 2 random samples, one for W and one for H */ + scalar_t* randomSamplesForPlane = randomSamples + plane * 2; + + /* Generate interval sequence */ + auto sequenceW = generate_intervals( + randomSamplesForPlane[0], inputW, outputW, poolSizeW); + auto sequenceH = generate_intervals( + randomSamplesForPlane[1], inputH, outputH, poolSizeH); + + /* loop over output */ + // NOLINTNEXTLINE(cppcoreguidelines-init-variables) + int h, w; + + scalar_t* inputForPlane = input + plane * inputW * inputH; + scalar_t* outputForPlane = output + plane * outputW * outputH; + int64_t* indicesForPlane = indices + plane * outputW * outputH; + + for (h = 0; h < outputH; ++h) { + int inputHStart = sequenceH[h]; + + for (w = 0; w < outputW; ++w) { + int inputWStart = sequenceW[w]; + + int h2 = inputHStart, w2 = inputWStart; + scalar_t maxVal = -std::numeric_limits::infinity(); + int64_t maxIndex = h2 * inputW + w2; + + for (h2 = inputHStart; h2 < inputHStart + poolSizeH; ++h2) { + for (w2 = inputWStart; w2 < inputWStart + poolSizeW; ++w2) { + AT_ASSERT(h2 >= 0 && h2 < inputH); + AT_ASSERT(w2 >= 0 && w2 < inputW); + + int planeIndex = h2 * inputW + w2; + scalar_t val = inputForPlane[planeIndex]; + if (val > maxVal || std::isnan(val)) { + maxVal = val; + maxIndex = planeIndex; + } + } + } + + outputForPlane[h * outputW + w] = maxVal; + indicesForPlane[h * outputW + w] = maxIndex; + } + } + } + }); + } + ``` + + 此文件实现了 `fractional_max_pool2d` 的主要逻辑,上面只摘抄了最关键的代码。 + + 主要逻辑为: + + - 生成采样的序列 + - 获取序列中的每个 pool 中的最大值 + + 其中,生成采样序列的逻辑在 `aten/src/ATen/native/FractionalMaxPooling.h`: + + ``` cpp + template + static inline std::vector generate_intervals( + scalar_t sample, + int64_t inputSize, + int64_t outputSize, + int64_t poolSize) { + std::vector sequence(outputSize); + if (outputSize > 1) { + scalar_t alpha = static_cast(inputSize - poolSize) / + static_cast(outputSize - 1); + + for (const auto i : c10::irange(outputSize - 1)) { + sequence[i] = + static_cast((i + sample) * alpha) - static_cast(sample * alpha); + } + } + if (outputSize > 0) { + sequence[outputSize - 1] = inputSize - poolSize; + } + return sequence; + } + ``` + +从上面的源代码分析可以看到,`PyTorch` 对于 `fractional_max_pool` 只实现了 `pseudo random` 的方式,而没有 `random` 的方式。 + + +## TensorFlow + +`TensorFlow` 实现了 `tf.nn.fractional_max_pool` 函数,对应 `PyTorch` 的函数为 `fractional_max_pool2d`。 + +相应的,实现了 `tf.raw_ops.FractionalMaxPool` ,对应 `PyTorch` 的 `FractionalMaxPool2d`。 + +`TensorFlow` 并没有 `3D` 相关的实现。 + +`3D` 相对 `2D` ,多了一个 `depth` 或者 `time` 等类似的维度。 + +相应文档: + +- [tf.raw_ops.FractionalMaxPool](https://tensorflow.google.cn/api_docs/python/tf/raw_ops/FractionalMaxPool?hl=en) +- [tf.nn.fractional_max_pool](https://tensorflow.google.cn/api_docs/python/tf/nn/fractional_max_pool?hl=en) + +相应接口为: + +- `tf.raw_ops.FractionalMaxPool` + + - 文档描述 + > Performs fractional max pooling on the input. + + - 参数列表 + > value – A Tensor. 4-D with shape [batch, height, width, channels]. + > pooling_ratio – An int or list of ints that has length 1, 2 or 4. + > pseudo_random – An optional bool. Defaults to False. When set to True, generates the pooling sequence in a pseudorandom fashion, otherwise, in a random fashion. + > overlapping – An optional bool. Defaults to False. When set to True, it means when pooling, the values at the boundary of adjacent pooling cells are used by both cells. + > deterministic – An optional bool. Defaults to False. When set to True, a fixed pooling region will be used when iterating over a FractionalMaxPool node in the computation graph. + > seed – An optional int. Defaults to 0. If set to be non-zero, the random number generator is seeded by the given seed. Otherwise it is seeded by a random seed. + > seed2 – An optional int. Defaults to 0. An second seed to avoid seed collision. + > name – A name for the operation (optional). + + - 返回值 + > output (A tuple of Tensor objects) + +- `tf.nn.fractional_max_pool` + + - 文档描述 + > Performs fractional max pooling on the input. + + - 参数列表 + > value – A Tensor. 4-D with shape [batch, height, width, channels]. + > pooling_ratio – An int or list of ints that has length 1, 2 or 4. + > pseudo_random – An optional bool. Defaults to False. When set to True, generates the pooling sequence in a pseudorandom fashion, otherwise, in a random fashion. + > overlapping – An optional bool. Defaults to False. When set to True, it means when pooling, the values at the boundary of adjacent pooling cells are used by both cells. + > seed – An optional int. Defaults to 0. If set to be non-zero, the random number generator is seeded by the given seed. Otherwise it is seeded by a random seed. + > name – A name for the operation (optional). + + - 返回值 + > output (A tuple of Tensor objects) + +实现逻辑: + +相关源代码涉及文件: + +- `tensorflow/python/ops/nn_ops.py` * +- `tensorflow/core/kernels/fractional_pool_common.h` +- `tensorflow/core/kernels/fractional_pool_common.cc` * +- `tensorflow/core/kernels/fractional_max_pool_op.cc` * + +这里只分析上述带有 `*` 的主要源文件。 + +- `tensorflow/python/ops/nn_ops.py` + + 这里注册 python 接口: + + ``` python + @tf_export("nn.fractional_max_pool", v1=[]) + @dispatch.add_dispatch_support + def fractional_max_pool_v2(value, + pooling_ratio, + pseudo_random=False, + overlapping=False, + seed=0, + name=None): # pylint: disable=redefined-builtin + if (isinstance(pooling_ratio, (list, tuple))): + if (pooling_ratio[0] != 1.0 or pooling_ratio[-1] != 1.0): + raise ValueError( + "`pooling_ratio` should have first and last elements with value 1.0. " + f"Received: pooling_ratio={pooling_ratio}") + for element in pooling_ratio: + if element < 1.0: + raise ValueError( + f"`pooling_ratio` elements should be >= 1.0. " + f"Received: pooling_ratio={pooling_ratio}") + elif (isinstance(pooling_ratio, (int, float))): + if pooling_ratio < 1.0: + raise ValueError( + "`pooling_ratio` should be >= 1.0. " + f"Received: pooling_ratio={pooling_ratio}") + else: + raise ValueError( + "`pooling_ratio` should be an int or a list of ints. " + f"Received: pooling_ratio={pooling_ratio}") + + pooling_ratio = _get_sequence(pooling_ratio, 2, 3, "pooling_ratio") + + if seed == 0: + if config.is_op_determinism_enabled(): + raise ValueError( + f"tf.nn.fractional_max_pool requires a non-zero seed to be passed in " + f"when determinism is enabled, but got seed={seed}. Please pass in a " + f'non-zero seed, e.g. by passing "seed=1".') + return gen_nn_ops.fractional_max_pool(value, pooling_ratio, pseudo_random, + overlapping, deterministic=False, + seed=0, seed2=0, name=name) + else: + seed1, seed2 = random_seed.get_seed(seed) + return gen_nn_ops.fractional_max_pool(value, pooling_ratio, pseudo_random, + overlapping, deterministic=True, + seed=seed1, seed2=seed2, name=name) + + ``` + + 可以看到,与 `PyTorch` 不同的是,`TensorFlow` 多了几个参数: + + - `overlapping` 控制 pool 边界是否计算在内 + - `pseudo_random` 是否是伪随机 + - `seed` 随机种子 + +- `tensorflow/core/kernels/fractional_max_pool_op.cc` + + 这里实现了主要逻辑: + + ``` cpp + template + class FractionalMaxPoolOp : public OpKernel { + public: + explicit FractionalMaxPoolOp(OpKernelConstruction* context) + : OpKernel(context) { + + ... + + if (deterministic_) { + // If both seeds are not set when deterministic_ is true, force set seeds. + if ((seed_ == 0) && (seed2_ == 0)) { + seed_ = random::New64(); + seed2_ = random::New64(); + } + } else { + OP_REQUIRES( + context, (seed_ == 0) && (seed2_ == 0), + errors::InvalidArgument( + "Both seed and seed2 should be 0 if deterministic is false.")); + } + } + + void Compute(OpKernelContext* context) override { + typedef Eigen::Map> + ConstEigenMatrixMap; + typedef Eigen::Map> + EigenMatrixMap; + + constexpr int tensor_in_and_out_dims = 4; + + const Tensor& tensor_in = context->input(0); + + std::vector input_size(tensor_in_and_out_dims); + std::vector output_size(tensor_in_and_out_dims); + for (int i = 0; i < tensor_in_and_out_dims; ++i) { + input_size[i] = tensor_in.dim_size(i); + } + // Output size. + for (int i = 0; i < tensor_in_and_out_dims; ++i) { + // This must match the same logic in the shape function in + // core/ops/nn_ops.cc. + output_size[i] = + static_cast(std::floor(input_size[i] / pooling_ratio_[i])); + DCHECK_GT(output_size[i], 0); + } + + // Generate pooling sequence. + std::vector height_cum_seq; + std::vector width_cum_seq; + GuardedPhiloxRandom generator; + generator.Init(seed_, seed2_); + height_cum_seq = GeneratePoolingSequence(input_size[1], output_size[1], + &generator, pseudo_random_); + width_cum_seq = GeneratePoolingSequence(input_size[2], output_size[2], + &generator, pseudo_random_); + + // Prepare output. + Tensor* output_tensor = nullptr; + Tensor* output_height_seq_tensor = nullptr; + Tensor* output_width_seq_tensor = nullptr; + + ConstEigenMatrixMap in_mat(tensor_in.flat().data(), input_size[3], + input_size[2] * input_size[1] * input_size[0]); + + EigenMatrixMap out_mat(output_tensor->flat().data(), output_size[3], + output_size[2] * output_size[1] * output_size[0]); + + // Initializes the output tensor with MIN. + output_tensor->flat().setConstant(Eigen::NumTraits::lowest()); + + auto output_height_seq_flat = output_height_seq_tensor->flat(); + auto output_width_seq_flat = output_width_seq_tensor->flat(); + + // Set output tensors. + for (int i = 0; i < height_cum_seq.size(); ++i) { + output_height_seq_flat(i) = height_cum_seq[i]; + } + + for (int i = 0; i < width_cum_seq.size(); ++i) { + output_width_seq_flat(i) = width_cum_seq[i]; + } + + // For both input and output, + // 0: batch + // 1: height / row + // 2: width / col + // 3: depth / channel + const int64_t height_max = input_size[1] - 1; + const int64_t width_max = input_size[2] - 1; + for (int64_t b = 0; b < input_size[0]; ++b) { + // height sequence. + for (int64_t hs = 0; hs < height_cum_seq.size() - 1; ++hs) { + // height start and end. + const int64_t height_start = height_cum_seq[hs]; + int64_t height_end = + overlapping_ ? height_cum_seq[hs + 1] : height_cum_seq[hs + 1] - 1; + height_end = std::min(height_end, height_max); + + // width sequence. + for (int64_t ws = 0; ws < width_cum_seq.size() - 1; ++ws) { + const int64_t out_offset = + (b * output_size[1] + hs) * output_size[2] + ws; + // width start and end. + const int64_t width_start = width_cum_seq[ws]; + int64_t width_end = + overlapping_ ? width_cum_seq[ws + 1] : width_cum_seq[ws + 1] - 1; + width_end = std::min(width_end, width_max); + for (int64_t h = height_start; h <= height_end; ++h) { + for (int64_t w = width_start; w <= width_end; ++w) { + const int64_t in_offset = + (b * input_size[1] + h) * input_size[2] + w; + out_mat.col(out_offset) = + out_mat.col(out_offset).cwiseMax(in_mat.col(in_offset)); + } + } + } + } + } + } + + }; + ``` + + 其中序列生成的函数在: `tensorflow/core/kernels/fractional_pool_common.cc` + + ``` cpp + static std::vector GeneratePoolingSequencePseudoRandom( + int input_length, int output_length, GuardedPhiloxRandom* generator) { + std::vector cum_seq(output_length + 1, 0); + std::vector diff(output_length, 0); + + double alpha = static_cast(input_length) / output_length; + int k = input_length / output_length; + + double u_max1 = (k + 2) / alpha - 1; + double u_max2 = (input_length + 1 - k) / alpha - (output_length - 1); + double max_u = std::min(u_max1, u_max2); + + // Generate random number in parallel. + auto local_gen = generator->ReserveSamples32(2); + random::SimplePhilox random(&local_gen); + const double u = random.RandDouble() * max_u; + + cum_seq[0] = 1; + cum_seq[output_length] = input_length + 1; + for (int i = 1; i < output_length; ++i) { + cum_seq[i] = static_cast(ceil(alpha * (i + u))); + } + + for (int i = 0; i < output_length; ++i) { + diff[i] = cum_seq[i + 1] - cum_seq[i]; + } + + return diff; + } + + static std::vector GeneratePoolingSequenceRandom( + int input_length, int output_length, GuardedPhiloxRandom* generator) { + int k = input_length / output_length; + int num_random_spot = input_length % output_length; + std::vector diff(output_length, k); + + for (int i = 0; i < num_random_spot; ++i) { + diff[i] += 1; + } + + // Randomly shuffle this vector. + auto local_gen = generator->ReserveSamples32(diff.size()); + random::SingleSampleAdapter single(&local_gen); + const auto uniform = [&single](uint32 n) { return single() % n; }; + RandomShuffle(diff.begin(), diff.end(), uniform); + + return diff; + } + + std::vector GeneratePoolingSequence(int input_length, + int output_length, + GuardedPhiloxRandom* generator, + bool pseudo_random) { + std::vector diff; + // This is a case that regular pooling can handle, just return diff with + // each element input_length/output_length. + if (input_length % output_length == 0) { + diff = std::vector(output_length, input_length / output_length); + } + + if (pseudo_random) { + diff = GeneratePoolingSequencePseudoRandom(input_length, output_length, + generator); + } else { + diff = + GeneratePoolingSequenceRandom(input_length, output_length, generator); + } + + // Sanity check. + int k = input_length / output_length; + for (int i = 0; i < output_length; ++i) { + // k<= diff[i] <= k+1. + DCHECK_GE(diff[i], k); + DCHECK_LE(diff[i], k + 1); + } + + // Return cumulative sequence. + std::vector cum_seq(output_length + 1, 0); + for (int i = 1; i < cum_seq.size(); ++i) { + cum_seq[i] = cum_seq[i - 1] + diff[i - 1]; + } + return cum_seq; + } + ``` + + 这里根据 `pseudo_random` 标记为生成 `伪` 随机序列,或者 `真` 随机序列。 + + +# 四、对比分析 + +抛开 `PyTorch` 与 `TensorFlow` 对于 API 的组织方式不同来说,两者: + +相同: + +- `PyTorch` 与 `TensorFlow` 都实现了 `fractional_max_pool` 函数。 +- `PyTorch` 与 `TensorFlow` 都是通过底层 c++ 实现具体逻辑,并通过 python 公开 API。 + +不同: + +- `PyTorch` 实现了 `2D` 与 `3D` 两种维度的函数,`TensorFlow` 只有 `2D` 这种维度(`channel` 不算在内)。 +- `TensorFlow` 有 `真` 随机与 `伪` 随机两种序列生成方式,`PyTorch` 只有 `伪` 随机一种。 +- `TensorFlow` 的实现更接近文章中的描述 + + 这是 `PyTorch` 与 `TensorFlow` 最大的不同点。文章中的 `fractional` 根据 `N_in/N_out` 得出,也就是说,只需要这两个参数即可。 + `PyTorch` 提供了 `kernel_size`、`output_size`、`output_ratio` 这三个参数,这三个参数都可以影响 `N_in/N_out`,这更像是传统池化的方法。 + `TensorFlow` 只提供了 `pooling_ratio`,利用这个参数即可得到 `N_out`,而且提供了 `overlapping` 参数,利用这个参数可以影响 `kernel_size`。而且,由此可以看出,`TensorFlow` 实现的 `fractional max pooling` 更具有一般性,而 `adaptive max pooling` 则可以看作 `fractional max pooling` 的一种特例。 + + - `fractional max pooling` : `a = ceiling(alpha(i+u)), 1 < alpha = N_in/N_out < 2, 0 < u < 1` + - `adaptive max pooling` : `a = ceiling(alpha(i+1)), 1 < alpha = N_in/N_out < 2` + +另外,两者都有反向梯度的计算(由于不影响主要逻辑分析,且代码较多,上述代码分析没有具体列出)。 + +由于飞桨已经实现了 `AdaptiveMaxPool1D / AdaptiveMaxPool2D / AdaptiveMaxPool3D`,其签名为: + +- `paddle.nn.AdaptiveMaxPool1D(output_size, return_mask=False, name=None)` + +为了保持一致性,这里也只使用 `output_size` 一个必要参数,实现方法更接近文章以及 `TensorFlow`。 + + +# 五、设计思路与实现方案 + +## 命名与参数设计 + +添加 python 上层接口: + +- `paddle.nn.functional.fractional_max_pool2d` +- `paddle.nn.FractionalMaxPool2d` + + ``` python + paddle.nn.functional.fractional_max_pool2d( + x:Tensor, + output_size:Union[int, list, tuple], + pseudo_random:bool=False, + overlapping:bool=False, + return_mask:bool=False, + seed:int=None, + name:str=None) + ``` + + - 参数列表 + > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 + > output_size (int|list|tuple) – 输出的尺寸。 + > pseudo_random (bool, optional) – 是否使用伪随机。 + > overlapping (bool, optional) – 是否考虑池化边界重叠。 + > return_mask (bool, optional) – 是否返回最大值的索引。 + > seed (int, optional) – 随机种子。 + > name (str, optional) – 操作名称。 + + - 返回值 + > output (Tensor) + + +- `paddle.nn.functional.fractional_max_pool3d` +- `paddle.nn.FractionalMaxPool3d` + + ``` python + paddle.nn.functional.fractional_max_pool3d( + x:Tensor, + output_size:Union[int, list, tuple], + pseudo_random:bool=False, + overlapping:bool=False, + return_mask:bool=False, + seed:int=None, + name:str=None) + ``` + + - 参数列表 + > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 + > output_size (int|list|tuple) – 输出的尺寸。 + > pseudo_random (bool, optional) – 是否使用伪随机。 + > overlapping (bool, optional) – 是否考虑池化边界重叠。 + > return_mask (bool, optional) – 是否返回最大值的索引。 + > seed (int, optional) – 随机种子。 + > name (str, optional) – 操作名称。 + + - 返回值 + > output (Tensor) + +这里重点分析 `paddle.nn.functional.fractional_max_poolNd` 接口的命名与参数设计,`paddle.nn.FractionalMaxPoolNd` 与之类似。 + + +## 底层 OP 设计 + +由于赛题要求直接实现 python 接口,所以此处直接使用 python API 实现,无需设计底层 c++ 相关 OP。 + +## API实现方案 + +之前分析过,`fractional max pooling` 的实现主要分为两步: + +- 生成随机池化序列 +- 根据序列取其中的最大值 + +### 生成随机池化序列 + +这里实现 `真` 随机与 `伪` 随机两种方法。 + +`真` 随机: + +``` python +def random_sequence(input_size, output_size): + base = input_size // output_size + count = input_size % output_size + + diff = [base] * output_size + + for i in range(count): + diff[i] += 1 + + diff = np.random.permutation(diff) + + return [0] + np.cumsum(diff).tolist(), diff.tolist() +``` + +`伪` 随机: + +``` python +def pseudo_sequence(input_size, output_size, sample): + seq = [0] * (output_size + 1) + diff = [0] * output_size + + alpha = input_size / output_size + base = input_size // output_size + + # 此处根据 tensorflow 实现中的说明,需要考虑的特殊情况 + u_max1 = (base + 2) / alpha - 1 + u_max2 = (input_size + 1 - base) / alpha - (output_size - 1) + max_u = min(u_max1, u_max2) + u = sample * max_u + + # 由于文章中的公式是 math.ceil(alpha * (i + u)) + # 所以,序列第一个元素肯定为 1, 而在实际运算中,第一个元素需要为 0 + # 所以,这里需要有 seq 中转一下 + seq[0] = 1 + seq[output_size] = input_size + 1 + for i in range(1, output_size): + seq[i] = math.ceil(alpha * (i + u)) + + for i in range(output_size): + diff[i] = seq[i + 1] - seq[i] + + return [0] + np.cumsum(diff).tolist(), diff + +``` + +### 根据序列取其中的最大值 + +``` python +def fractional_max_pool2d( + x:Tensor, + output_size:Union[int, list, tuple], + pseudo_random:bool=False, + overlapping:bool=False, + return_mask:bool=False, + seed:int=None, + name:str=None): + + if seed is not None: + paddle.seed(seed) + np.random.seed(seed) + + # [N, C, H, W] + batch = x.shape[0] + channel = x.shape[1] + h_size = x.shape[2] + w_size = x.shape[3] + + if pseudo_random: + # 假设 output_size 是一个 int + sample = np.random.rand() + h_seq, _ = pseudo_sequence(h_size, output_size, sample) + w_seq, _ = pseudo_sequence(w_size, output_size, sample) + else: + h_seq, _ = random_sequence(h_size, output_size) + w_seq, _ = random_sequence(w_size, output_size) + + output_tensor = paddle.zeros(shape=[batch, channel, output_size, output_size], dtype=x.dtype) + + for _batch in range(batch): + for _channel in range(channel): + _x = x[_batch, _channel, :, :] + _x_h = _pool_along_h(_x, h_seq, overlapping) + output_tensor[_batch, _channel, :, :] = _pool_along_w(_x_h, w_seq, overlapping) + + return output_tensor + +# 参考 TensorFlow 的 python 测试方法,这里使用纯 python 的方式实现 max pooling +def _pool_along_h(x, seq, overlapping): + output_tensor = paddle.zeros(x.shape[1], dtype=x.dtype) + h_max = seq[-1] + for i in range(len(seq) - 1): + h_start = seq[i] + h_end = seq[i+1] + 1 if overlapping else seq[i+1] + h_end = min(h_end, h_max) + output_tensor = paddle.vstack((output_tensor, paddle.amax( + x[h_start:h_end, :], axis=0 + ))) + + return output_tensor[1:, :] + +def _pool_along_w(x, seq, overlapping): + return _pool_along_h(x.T, seq, overlapping).T + +``` + +这里面会用到 `vstack` 函数,会在其他任务中实现,所以这里直接使用了这个函数。 + +另外,`output_size` 为 `list/tuple` 的情况,需要单独计算 `h_seq / w_seq`。 + +如果需要 `return_mask` 则要使用 `argmax`。 + +最后,`fractional_max_pool3d` 的实现方法类似,这里不再赘述。 + + +# 六、测试和验收的考量 + +测试考虑的case如下: + +- **编程范式场景** + 常规覆盖动态图和静态图的测试场景 + +- **硬件场景** + 常规需覆盖 CPU、GPU 两种测试场景 + +- **参数组合场景** + - 需要测试 2D / 3D 两类接口 + - 需要测试 1 < N_in/N_out < 2, N_in/N_out > 2 的情况 + - 需要测试 output_size 为 int/list/tuple 的情况 + - 需要测试 random / pseudo_random + - 需要测试 overlapping + - 需要测试 return_mask + - 需要测试 不同数据类型的场景 + - 需要异常测试,如 N_in/N_out < 1 + +- **计算精度** + 需要保证 `前向/后向` 计算的精度正确性,通过 numpy 实现的函数的对比结果 + +- **维度测试** + - 需要测试 2D / 3D 两类接口 + +# 七、可行性分析及规划排期 + +- 每个接口开发约 3 个工作日 +- 每个接口测试约 3 个工作日 + +计划 2~3 周的工作量可以完成接口的开发预测是。 + +# 八、影响面 + +无其他影响。 + +# 名词解释 + +无 + +# 附件及参考资料 + +- [《Fractional Max-Pooling》](https://arxiv.org/abs/1412.6071) +- [FRACTIONALMAXPOOL2D](https://pytorch.org/docs/stable/generated/torch.nn.FractionalMaxPool2d.html#fractionalmaxpool2d) +- [FRACTIONALMAXPOOL3D](https://pytorch.org/docs/stable/generated/torch.nn.FractionalMaxPool3d.html#fractionalmaxpool3d) +- [TORCH.NN.FUNCTIONAL.FRACTIONAL_MAX_POOL2D](https://pytorch.org/docs/stable/generated/torch.nn.functional.fractional_max_pool2d.html#torch.nn.functional.fractional_max_pool2d) +- [TORCH.NN.FUNCTIONAL.FRACTIONAL_MAX_POOL3D](https://pytorch.org/docs/stable/generated/torch.nn.functional.fractional_max_pool3d.html#torch.nn.functional.fractional_max_pool3d) +- [tf.raw_ops.FractionalMaxPool](https://tensorflow.google.cn/api_docs/python/tf/raw_ops/FractionalMaxPool?hl=en) +- [tf.nn.fractional_max_pool](https://tensorflow.google.cn/api_docs/python/tf/nn/fractional_max_pool?hl=en) + +# 补充一下 + +这里有个疑问:是否可以利用 `adaptive max pooling` 的 `方式` 实现 `fractional max pooling`? + +之前也分析过: + +- `fractional max pooling` : `a = ceiling(alpha(i+u)), 1 < alpha = N_in/N_out < 2, 0 < u < 1` +- `adaptive max pooling` : `a = ceiling(alpha(i+1)), 1 < alpha = N_in/N_out < 2` + +也就是说,`adaptive max pooling` 可以看作 `fractional max pooling` 的一种特例?不知道我这里理解的是否有问题? + +如果有一个 `伪` 序列的生成方式: + +``` python +def pseudo_sequence(input_size, output_size, sample): + alpha = input_size / output_size + start_idx = [] + end_idx = [] + for i in range(output_size): + start_idx.append(math.floor(alpha * (i))) + end_idx.append(math.ceil(alpha * (i + sample))) + + return start_idx, end_idx + +``` + +其中 `sample = 1` 是否就是 `adaptive max pooling`? + +我理解的 `fractional max pooling` 其实只要通过 `1221112112...` 这种方式实现 `fractional` 缩小尺寸就达到目的了。 + +如果实现类似 `PyTorch` 的方式,只有一个 `伪` 序列,是不是就足够了?因为文章中也提到,`伪` 序列是更 `faithfully` 的一种方式。 + +只是,这里的序列也不是完全符合 `1221112112...`: + +``` python +In [191]: diffs = [] + ...: for i in range(1, 11): + ...: start_idx, end_idx = pr(input_size, output_size, i/10) + ...: start_idx = np.array(start_idx) + ...: end_idx = np.array(end_idx) + ...: diff = end_idx - start_idx + ...: diffs.append(diff) + ...: if np.any(diff <= 0): + ...: print('error', i) + ...: print(i) + ...: print(start_idx) + ...: print(end_idx) + ...: print(diff) + ...: +1 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 1 2 3 5 6 8 9 10 12 13 15 16 17 19 20 21 23 24] +[1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1] +2 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 1 2 4 5 6 8 9 10 12 13 15 16 17 19 20 22 23 24] +[1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1] +3 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23 25] +[1 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 2] +4 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 1 2 4 5 7 8 9 11 12 14 15 16 18 19 20 22 23 25] +[1 1 2 1 2 2 1 2 1 2 2 1 2 1 1 2 1 2] +5 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 1 3 4 5 7 8 10 11 12 14 15 16 18 19 21 22 23 25] +[1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 1 2] +6 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 1 3 4 5 7 8 10 11 12 14 15 17 18 19 21 22 24 25] +[1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2] +7 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 1 3 4 6 7 8 10 11 13 14 15 17 18 20 21 22 24 25] +[1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] +8 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 2 3 4 6 7 9 10 11 13 14 15 17 18 20 21 22 24 25] +[2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2] +9 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 2 3 5 6 7 9 10 11 13 14 16 17 18 20 21 23 24 25] +[2 2 3 2 2 3 2 2 2 2 3 2 2 2 2 3 2 2] +10 +[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] +[ 2 3 5 6 7 9 10 12 13 14 16 17 19 20 21 23 24 25] +[2 2 3 2 2 3 2 3 2 2 3 2 3 2 2 3 2 2] +``` + +目前 Paddle 的 `adaptive max pooling` 生成序列的文件: + +- `paddle/phi/kernels/funcs/pooling.h` + + ``` cpp + /* used for adaptive pool to calculate start and end index of each divided grid + */ + HOSTDEVICE inline int AdaptStartIndex(int ph, int input_size, int output_size) { + return static_cast( + floor(static_cast(ph * input_size) / output_size)); + } + + HOSTDEVICE inline int AdaptEndIndex(int ph, int input_size, int output_size) { + return static_cast( + ceil(static_cast((ph + 1) * input_size) / output_size)); + } + + ``` + +是不是 `AdaptEndIndex` 增加一个参数,把 `1` 换成随机数,便可以给 `adaptive max pooling` 和 `fractional max pooling` 使用? + +具体实现的时候,不需要单独实现 `fractional max pooling`,只需要在目前的代码上修改 `adaptive max pooling` 并作分流? + +以上是我对目前实现方式的一点补充思考,之所以有这种想法,主要是因为,目前如果以纯 python 实现 `fractional max pooling`,感觉跟其他池化方法不太搭 ... ... + +还有很多不完善的地方,希望能够讨论一下 ~ + +谢谢! + + From d7c8e04d40c31f27f2e66bb3ac1a2d6152a61801 Mon Sep 17 00:00:00 2001 From: megemini Date: Sat, 28 Oct 2023 19:18:43 +0800 Subject: [PATCH 2/6] [Change] use cpp op instead of python --- ...1009_api_design_for_fractional_max_pool.md | 1136 +++++++++++++---- 1 file changed, 895 insertions(+), 241 deletions(-) diff --git a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md index cbb3f8c3c..3e1364677 100644 --- a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md +++ b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md @@ -4,10 +4,13 @@ | - | - | | 提交作者 | megemini(柳顺) | | 提交时间 | 2023-10-09 | -| 版本号 | V1.0 | +| 版本号 | V2.0 | | 依赖飞桨版本 | develop | | 文件名 | 20231009_api_design_for_fractional_max_pool.md | +#### 修订记录 +v2.0: 将实现方式由 python 改为 c++ + # 一、概述 ## 1、相关背景 @@ -760,7 +763,17 @@ # 五、设计思路与实现方案 -## 命名与参数设计 +本方案共涉及三部分: + +- 命名与参数设计 (python API) : `paddle.nn.functional.fractional_max_pool2d`, `paddle.nn.functional.fractional_max_pool3d` +- 底层 OP 设计 +- python layer 实现 : `paddle.nn.FractionalMaxPool2d`, `paddle.nn.FractionalMaxPool3d` + +由于 `fractional max pooling` 与 `adaptive max pooling` 接口特性较为相似,后续设计方案以 `共用 adaptive max pooling 底层算子` 为主要设计思路。 + +## 命名与参数设计 (python API) + +涉及文件:`python/paddle/nn/functional/pooling.py` 添加 python 上层接口: @@ -771,25 +784,19 @@ paddle.nn.functional.fractional_max_pool2d( x:Tensor, output_size:Union[int, list, tuple], - pseudo_random:bool=False, - overlapping:bool=False, return_mask:bool=False, - seed:int=None, name:str=None) ``` - 参数列表 > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 > output_size (int|list|tuple) – 输出的尺寸。 - > pseudo_random (bool, optional) – 是否使用伪随机。 - > overlapping (bool, optional) – 是否考虑池化边界重叠。 > return_mask (bool, optional) – 是否返回最大值的索引。 - > seed (int, optional) – 随机种子。 > name (str, optional) – 操作名称。 - 返回值 - > output (Tensor) - + > Tensor, return_mask=False + > Tensor and mask, return_mask=True - `paddle.nn.functional.fractional_max_pool3d` - `paddle.nn.FractionalMaxPool3d` @@ -798,166 +805,937 @@ paddle.nn.functional.fractional_max_pool3d( x:Tensor, output_size:Union[int, list, tuple], - pseudo_random:bool=False, - overlapping:bool=False, return_mask:bool=False, - seed:int=None, name:str=None) ``` - 参数列表 > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 > output_size (int|list|tuple) – 输出的尺寸。 - > pseudo_random (bool, optional) – 是否使用伪随机。 - > overlapping (bool, optional) – 是否考虑池化边界重叠。 > return_mask (bool, optional) – 是否返回最大值的索引。 - > seed (int, optional) – 随机种子。 > name (str, optional) – 操作名称。 - 返回值 - > output (Tensor) + > Tensor, return_mask=False + > Tensor and mask, return_mask=True 这里重点分析 `paddle.nn.functional.fractional_max_poolNd` 接口的命名与参数设计,`paddle.nn.FractionalMaxPoolNd` 与之类似。 +*注意* : 相较 v1.0 版本的设计文档,这里简化了较多的参数,特说明如下: + +- 不使用 `data_format` + + 分析目前 pooling 接口主要源文件 `python/paddle/nn/functional/pooling.py`,以 `max_pool2d` 为例: + + - 主要涉及两个底层算子: `max_pool2d_with_index` 和 `pool2d` + - 其中 `max_pool2d_with_index` 可以返回 `mask`,`pool2d` 不可以返回 `mask` + - 其中 `max_pool2d_with_index` 不支持 `data_format`,`pool2d` 支持 `data_format` + + 因此,当使用 `return_mask` 返回 `mask` 时,`data_format must be set to NCHW`。 + 没有一个算子能够完整支持这两个参数,这是目前 pooling 底层算子较大的矛盾。 + + 由于设计方案以 `共用 adaptive max pooling 底层算子` 为主要设计思路,所以,这里参考 `adaptive max pooling` 的接口: + + `adaptive_max_pool2d(x, output_size, return_mask=False, name=None)` + + 不使用 `data_format` 参数。 + +- 移除 `pseudo_random`, `overlapping`, `seed` + + 由于 `共用 adaptive max pooling 底层算子`,且参考 `PyTorch` 的设计方案,这里将只使用 `伪` 随机的方式生成池化序列,并在 c++ 算子内部实现。 ## 底层 OP 设计 -由于赛题要求直接实现 python 接口,所以此处直接使用 python API 实现,无需设计底层 c++ 相关 OP。 +涉及文件: + +- `paddle/phi/api/yaml/ops.yaml` 算子描述及定义 + + ``` yaml + - op : max_pool2d_with_index + args : (Tensor x, int[] kernel_size, int[] strides= {1, 1}, int[] paddings = {0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) + output : Tensor(out), Tensor(mask) + infer_meta : + func : MaxPoolWithIndexInferMeta + kernel : + func : max_pool2d_with_index + backward : max_pool2d_with_index_grad + + - op : max_pool3d_with_index + args : (Tensor x, int[] kernel_size, int[] strides = {1, 1, 1}, int[] paddings = {0, 0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) + output : Tensor(out), Tensor(mask) + infer_meta : + func : MaxPoolWithIndexInferMeta + kernel : + func : max_pool3d_with_index + backward : max_pool3d_with_index_grad + ``` + + 增加 `bool` 类型 `fractional` 参数,默认为 `false` + +- `paddle/phi/api/yaml/backward.yaml` 算子描述及定义 + + ``` yaml + - backward_op : max_pool2d_with_index_grad + forward : max_pool2d_with_index(Tensor x, int[] kernel_size, int[] strides = {1, 1}, int[] paddings = {0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) -> Tensor(out), Tensor(mask) + args : (Tensor x, Tensor mask, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive, bool fractional) + output : Tensor(x_grad) + infer_meta : + func : MaxPoolWithIndexGradInferMeta + kernel : + func : max_pool2d_with_index_grad + + - backward_op : max_pool3d_with_index_grad + forward : max_pool3d_with_index(Tensor x, int[] kernel_size, int[] strides = {1, 1, 1}, int[] paddings = {0, 0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) -> Tensor(out), Tensor(mask) + args : (Tensor x, Tensor mask, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive, bool fractional) + output : Tensor(x_grad) + infer_meta : + func : MaxPoolWithIndexGradInferMeta + kernel : + func : max_pool3d_with_index_grad + ``` -## API实现方案 + 增加 `bool` 类型 `fractional` 参数,默认为 `false` -之前分析过,`fractional max pooling` 的实现主要分为两步: +- `paddle/phi/infermeta/unary.h` 算子 InferMeta -- 生成随机池化序列 -- 根据序列取其中的最大值 + ``` cpp + void MaxPoolWithIndexInferMeta(const MetaTensor& x, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + MetaTensor* out, + MetaTensor* mask, + MetaConfig config = MetaConfig()); + ``` + 增加 `fractional` 参数 -### 生成随机池化序列 +- `paddle/phi/infermeta/unary.cc` -这里实现 `真` 随机与 `伪` 随机两种方法。 + ``` cpp + void MaxPoolWithIndexInferMeta(const MetaTensor& x, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + MetaTensor* out, + MetaTensor* mask, + MetaConfig config) { + ... + if (adaptive || fractional) { + output_shape.insert( + output_shape.end(), kernel_size_.begin(), kernel_size_.end()); + } else { + ... + } + ... + } + ``` -`真` 随机: + 增加 `fractional` 参数,并且,与 `adaptive` 一样,共用 `kernel_size_` 参数,此参数在此实际为 `output_size`。 -``` python -def random_sequence(input_size, output_size): - base = input_size // output_size - count = input_size % output_size +- `paddle/phi/infermeta/backward.h` - diff = [base] * output_size + ``` cpp + void MaxPoolWithIndexGradInferMeta(const MetaTensor& x, + const MetaTensor& mask, + const MetaTensor& dout, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + MetaTensor* dx); + ``` - for i in range(count): - diff[i] += 1 + 增加 `fractional` 参数。 - diff = np.random.permutation(diff) +- `paddle/phi/infermeta/backward.cc` - return [0] + np.cumsum(diff).tolist(), diff.tolist() -``` + ``` cpp + void MaxPoolWithIndexGradInferMeta(const MetaTensor& x, + const MetaTensor& mask, + const MetaTensor& dout, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + MetaTensor* dx) { + dx->share_meta(x); + } + ``` -`伪` 随机: + 增加 `fractional` 参数。 -``` python -def pseudo_sequence(input_size, output_size, sample): - seq = [0] * (output_size + 1) - diff = [0] * output_size +- `paddle/phi/kernels/pool_kernel.h` 算子 Kernel - alpha = input_size / output_size - base = input_size // output_size + ``` cpp + template + void MaxPool2dWithIndexKernel(const Context& ctx, + const DenseTensor& x, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* out, + DenseTensor* mask); + + template + void MaxPool3dWithIndexKernel(const Context& ctx, + const DenseTensor& x, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* out, + DenseTensor* mask); + ``` - # 此处根据 tensorflow 实现中的说明,需要考虑的特殊情况 - u_max1 = (base + 2) / alpha - 1 - u_max2 = (input_size + 1 - base) / alpha - (output_size - 1) - max_u = min(u_max1, u_max2) - u = sample * max_u + 增加 `fractional` 参数。 - # 由于文章中的公式是 math.ceil(alpha * (i + u)) - # 所以,序列第一个元素肯定为 1, 而在实际运算中,第一个元素需要为 0 - # 所以,这里需要有 seq 中转一下 - seq[0] = 1 - seq[output_size] = input_size + 1 - for i in range(1, output_size): - seq[i] = math.ceil(alpha * (i + u)) - for i in range(output_size): - diff[i] = seq[i + 1] - seq[i] +- `paddle/phi/kernels/funcs/pooling.h` - return [0] + np.cumsum(diff).tolist(), diff + ``` cpp + template + class MaxPool2dWithIndexFunctor { + public: + void operator()(const Context& context, + const DenseTensor& input, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* output, + DenseTensor* mask); + }; + + template + class MaxPool2dWithIndexGradFunctor { + public: + void operator()(const Context& context, + const DenseTensor& output_grad, + const DenseTensor& mask, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* input_grad); + }; + + template + class MaxPool3dWithIndexFunctor { + public: + void operator()(const Context& context, + const DenseTensor& input, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* output, + DenseTensor* mask); + }; + + template + class MaxPool3dWithIndexGradFunctor { + public: + void operator()(const Context& context, + const DenseTensor& output_grad, + const DenseTensor& mask, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* input_grad); + }; + ``` -``` + 增加 `fractional` 参数。 -### 根据序列取其中的最大值 + ``` cpp + HOSTDEVICE inline int FractionalStartIndex() + HOSTDEVICE inline int FractionalEndIndex() + ``` -``` python -def fractional_max_pool2d( - x:Tensor, - output_size:Union[int, list, tuple], - pseudo_random:bool=False, - overlapping:bool=False, - return_mask:bool=False, - seed:int=None, - name:str=None): + 生成池化序列的方法。 - if seed is not None: - paddle.seed(seed) - np.random.seed(seed) +- `paddle/phi/kernels/impl/pool_kernel_impl.h` - # [N, C, H, W] - batch = x.shape[0] - channel = x.shape[1] - h_size = x.shape[2] - w_size = x.shape[3] + ``` cpp + template + void MaxPoolWithIndexRawKernel(const Context& ctx, + const DenseTensor& x, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* out, + DenseTensor* mask) { + std::vector paddings_ = paddings; + std::vector kernel_size_ = kernel_size; + + if (global_pooling) { + for (size_t i = 0; i < kernel_size_.size(); ++i) { + paddings_[i] = 0; + kernel_size_[i] = static_cast(x.dims()[i + 2]); + } + } - if pseudo_random: - # 假设 output_size 是一个 int - sample = np.random.rand() - h_seq, _ = pseudo_sequence(h_size, output_size, sample) - w_seq, _ = pseudo_sequence(w_size, output_size, sample) - else: - h_seq, _ = random_sequence(h_size, output_size) - w_seq, _ = random_sequence(w_size, output_size) - - output_tensor = paddle.zeros(shape=[batch, channel, output_size, output_size], dtype=x.dtype) - - for _batch in range(batch): - for _channel in range(channel): - _x = x[_batch, _channel, :, :] - _x_h = _pool_along_h(_x, h_seq, overlapping) - output_tensor[_batch, _channel, :, :] = _pool_along_w(_x_h, w_seq, overlapping) - - return output_tensor - -# 参考 TensorFlow 的 python 测试方法,这里使用纯 python 的方式实现 max pooling -def _pool_along_h(x, seq, overlapping): - output_tensor = paddle.zeros(x.shape[1], dtype=x.dtype) - h_max = seq[-1] - for i in range(len(seq) - 1): - h_start = seq[i] - h_end = seq[i+1] + 1 if overlapping else seq[i+1] - h_end = min(h_end, h_max) - output_tensor = paddle.vstack((output_tensor, paddle.amax( - x[h_start:h_end, :], axis=0 - ))) + switch (kernel_size_.size()) { + case 2: { + funcs::MaxPool2dWithIndexFunctor pool2d_forward; + pool2d_forward(ctx, + x, + kernel_size_, + strides, + paddings_, + adaptive, + fractional, + out, + mask); + } break; + case 3: { + funcs::MaxPool3dWithIndexFunctor pool3d_forward; + pool3d_forward(ctx, + x, + kernel_size_, + strides, + paddings_, + adaptive, + fractional, + out, + mask); + } break; + default: { + PADDLE_THROW( + errors::InvalidArgument("Pool op only supports 2D and 3D input.")); + } + } + } + + template + void MaxPool2dWithIndexKernel(const Context& ctx, + const DenseTensor& x, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* out, + DenseTensor* mask) { + MaxPoolWithIndexRawKernel(ctx, + x, + kernel_size, + strides, + paddings, + global_pooling, + adaptive, + fractional, + out, + mask); + } + + template + void MaxPool3dWithIndexKernel(const Context& ctx, + const DenseTensor& x, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* out, + DenseTensor* mask) { + MaxPoolWithIndexRawKernel(ctx, + x, + kernel_size, + strides, + paddings, + global_pooling, + adaptive, + fractional, + out, + mask); + } + + ``` + + 增加 `fractional` 参数,分发方法时带上 `fracional`。 + +- `paddle/phi/kernels/pool_grad_kernel.h` 反向算子 + + ``` cpp + template + void MaxPool2dWithIndexGradKernel(const Context& ctx, + const DenseTensor& x, + const DenseTensor& mask, + const DenseTensor& dout, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fracional, + DenseTensor* dx); + + template + void MaxPool3dWithIndexGradKernel(const Context& ctx, + const DenseTensor& x, + const DenseTensor& mask, + const DenseTensor& dout, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fracional, + DenseTensor* dx); + + ``` + + 增加 `fractional` 参数。 + +- `paddle/phi/kernels/impl/pool_grad_kernel_impl.h` + + ``` cpp + template + void MaxPoolWithIndexGradRawKernel(const Context& ctx, + const DenseTensor& x UNUSED, + const DenseTensor& mask, + const DenseTensor& dout, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* dx) { + std::vector paddings_ = paddings; + std::vector kernel_size_ = kernel_size; + + if (global_pooling) { + for (size_t i = 0; i < kernel_size_.size(); ++i) { + paddings_[i] = 0; + kernel_size_[i] = static_cast(dx->dims()[i + 2]); + } + } + + if (dx) { + ctx.template Alloc(dx); + funcs::set_constant(ctx, dx, 0); + + switch (kernel_size_.size()) { + case 2: { + funcs::MaxPool2dWithIndexGradFunctor pool2d_backward; + pool2d_backward(ctx, + dout, + mask, + kernel_size_, + strides, + paddings_, + adaptive, + fractional, + dx); + } break; + case 3: { + funcs::MaxPool3dWithIndexGradFunctor pool3d_backward; + pool3d_backward(ctx, + dout, + mask, + kernel_size_, + strides, + paddings_, + adaptive, + fractional, + dx); + } break; + default: { + PADDLE_THROW( + errors::InvalidArgument("Pool op only supports 2D and 3D input.")); + } + } + } + } + + template + void MaxPool2dWithIndexGradKernel(const Context& ctx, + const DenseTensor& x, + const DenseTensor& mask, + const DenseTensor& dout, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* dx) { + MaxPoolWithIndexGradRawKernel(ctx, + x, + mask, + dout, + kernel_size, + strides, + paddings, + global_pooling, + adaptive, + fractional, + dx); + } + + template + void MaxPool3dWithIndexGradKernel(const Context& ctx, + const DenseTensor& x, + const DenseTensor& mask, + const DenseTensor& dout, + const std::vector& kernel_size, + const std::vector& strides, + const std::vector& paddings, + bool global_pooling, + bool adaptive, + bool fractional, + DenseTensor* dx) { + MaxPoolWithIndexGradRawKernel(ctx, + x, + mask, + dout, + kernel_size, + strides, + paddings, + global_pooling, + adaptive, + fractional, + dx); + } + + ``` + + 增加 `fractional` 参数,分发方法时带上 `fracional`。 + +- `paddle/phi/kernels/funcs/pooling.cc` 算子 CPU 实现 + + ``` cpp + template + class MaxPool2dWithIndexFunctor { + public: + void operator()(const CPUContext& context, + const DenseTensor& input, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* output, + DenseTensor* mask) { + ... + + int hstart = 0, hend = 0; + int wstart = 0, wend = 0; + for (int i = 0; i < batch_size; i++) { + for (int c = 0; c < output_channels; ++c) { + for (int ph = 0; ph < output_height; ++ph) { + if (adaptive) { + hstart = AdaptStartIndex(ph, input_height, output_height); + hend = AdaptEndIndex(ph, input_height, output_height); + } else if (fractional) { + // TODO(megemini) + } else { + hstart = ph * stride_height - padding_height; + hend = std::min(hstart + ksize_height, input_height); + hstart = std::max(hstart, 0); + } + for (int pw = 0; pw < output_width; ++pw) { + if (adaptive) { + wstart = AdaptStartIndex(pw, input_width, output_width); + wend = AdaptEndIndex(pw, input_width, output_width); + } else if (fractional) { + // TODO(megemini) + } else { + wstart = pw * stride_width - padding_width; + wend = std::min(wstart + ksize_width, input_width); + wstart = std::max(wstart, 0); + } + + ... + } + } + ... + } + } + } + }; + + /* + * All tensors are in NCHW format. + * Ksize, strides, paddings are two elements. These two elements represent + * height and width, respectively. + */ + template + class MaxPool2dWithIndexGradFunctor { + public: + void operator()(const CPUContext& context, + const DenseTensor& output_grad, + const DenseTensor& mask, + const std::vector& ksize UNUSED, + const std::vector& strides UNUSED, + const std::vector& paddings UNUSED, + bool adaptive UNUSED, + bool fractional UNUSED, + DenseTensor* input_grad) { + };} + + /* + * All tensors are in NCDHW format. + * Ksize, strides, paddings are three elements. These three elements represent + * depth, height and width, respectively. + */ + template + class MaxPool3dWithIndexFunctor { + public: + void operator()(const CPUContext& context, + const DenseTensor& input, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* output, + DenseTensor* mask) { + ... + + int dstart = 0, dend = 0; + int hstart = 0, hend = 0; + int wstart = 0, wend = 0; + for (int i = 0; i < batch_size; i++) { + for (int c = 0; c < output_channels; ++c) { + for (int pd = 0; pd < output_depth; ++pd) { + if (adaptive) { + dstart = AdaptStartIndex(pd, input_depth, output_depth); + dend = AdaptEndIndex(pd, input_depth, output_depth); + } else if (fractional) { + /* TODO(megemini) */ + } else { + dstart = pd * stride_depth - padding_depth; + dend = std::min(dstart + ksize_depth, input_depth); + dstart = std::max(dstart, 0); + } + for (int ph = 0; ph < output_height; ++ph) { + if (adaptive) { + hstart = AdaptStartIndex(ph, input_height, output_height); + hend = AdaptEndIndex(ph, input_height, output_height); + } else if (fractional) { + /* TODO(megemini) */ + } else { + hstart = ph * stride_height - padding_height; + hend = std::min(hstart + ksize_height, input_height); + hstart = std::max(hstart, 0); + } + for (int pw = 0; pw < output_width; ++pw) { + if (adaptive) { + wstart = AdaptStartIndex(pw, input_width, output_width); + wend = AdaptEndIndex(pw, input_width, output_width); + } else if (fractional) { + // TODO(megemini) + } else { + wstart = pw * stride_width - padding_width; + wend = std::min(wstart + ksize_width, input_width); + wstart = std::max(wstart, 0); + } + + ... + } + } + ... + } + } + } + };} + + /* + * All tensors are in NCDHW format. + * Ksize, strides, paddings are three elements. These three elements represent + * depth, height and width, respectively. + */ + template + class MaxPool3dWithIndexGradFunctor { + public: + void operator()(const CPUContext& context, + const DenseTensor& output_grad, + const DenseTensor& mask, + const std::vector& ksize UNUSED, + const std::vector& strides UNUSED, + const std::vector& paddings UNUSED, + bool adaptive UNUSED, + bool fractional UNUSED, + DenseTensor* input_grad) { + };} + + ``` + + 这里实现主要的 cpu 算子的逻辑(正向与反向),通过 `fractional` 参数生成池化序列,主要逻辑与 `adaptive` 相似。 - return output_tensor[1:, :] + 这里没有 `data_format` 参数的设计,建议后续能够统一 `poolNd` 与 `max_poolNd` 的算子实现。 -def _pool_along_w(x, seq, overlapping): - return _pool_along_h(x.T, seq, overlapping).T + 另外,这里需要再增加一个 `0 < random < 1` 的随机数,以生成 `伪` 随机池化序列,这个随机数需要可以通过 `paddle.seed` 固定住。 -``` +- `paddle/phi/kernels/funcs/pooling.cu` 算子 GPU 实现 -这里面会用到 `vstack` 函数,会在其他任务中实现,所以这里直接使用了这个函数。 + ``` cpp + template + __global__ void KernelMaxPool2dWithIdx(const int nthreads, + const T1* input_data, + const int channels, + const int input_height, + const int input_width, + const int output_height, + const int output_width, + const int ksize_height, + const int ksize_width, + const int stride_height, + const int stride_width, + const int padding_height, + const int padding_width, + bool adaptive, + bool fractional, + T1* output_data, + T2* mask_data, + FastDivModForPooling divmods) + + template + __global__ void KernelMaxPool2DWithIdxGrad(const int nthreads, + const T1* output_grad, + const T2* mask_data, + const int channels, + const int input_height, + const int input_width, + const int output_height, + const int output_width, + const int ksize_height, + const int ksize_width, + const int stride_height, + const int stride_width, + const int padding_height, + const int padding_width, + bool adaptive, + bool fractional, + T1* input_grad, + FastDivModForPooling divmods) + /* + * All tensors are in NCHW format. + * Ksize, strides, paddings are two elements. These two elements represent + * height and width, respectively. + */ + template + class MaxPool2dWithIndexFunctor { + public: + void operator()(const phi::GPUContext& context, + const DenseTensor& input, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* output, + DenseTensor* mask) + }; + + /* + * All tensors are in NCHW format. + * Ksize, strides, paddings are two elements. These two elements represent + * height and width, respectively. + */ + template + class MaxPool2dWithIndexGradFunctor { + public: + void operator()(const phi::GPUContext& context, + const DenseTensor& output_grad, + const DenseTensor& mask, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* input_grad) + }; + + + template + __global__ void KernelMaxPool3DWithIdx(const int ncd, + const T1* input_data, + const int channels, + const int input_depth, + const int input_height, + const int input_width, + const int output_depth, + const int output_height, + const int output_width, + const int ksize_depth, + const int ksize_height, + const int ksize_width, + const int stride_depth, + const int stride_height, + const int stride_width, + const int padding_depth, + const int padding_height, + const int padding_width, + bool adaptive, + bool fractional, + T1* output_data, + T2* mask_data, + FastDivModForPooling3D divmods_output) + + template + __global__ void KernelMaxPool3DWithIdxGrad( + const int ncd, + const T1* output_grad, + const T2* mask, + const int channels, + const int input_depth, + const int input_height, + const int input_width, + const int output_depth, + const int output_height, + const int output_width, + const int ksize_depth, + const int ksize_height, + const int ksize_width, + const int stride_depth, + const int stride_height, + const int stride_width, + const int padding_depth, + const int padding_height, + const int padding_width, + bool adaptive, + bool fractional, + T1* input_grad, + FastDivModForPooling3D divmods_output) + + /* + * All tensors are in NCDHW format. + * Ksize, strides, paddings are three elements. These three elements represent + * depth, height and width, respectively. + */ + template + class MaxPool3dWithIndexFunctor { + public: + void operator()(const phi::GPUContext& context, + const DenseTensor& input, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* output, + DenseTensor* mask) + }; + + /* + * All tensors are in NCDHW format. + * Ksize, strides, paddings are three elements. These three elements represent + * depth, height and width, respectively. + */ + template + class MaxPool3dWithIndexGradFunctor { + public: + void operator()(const phi::GPUContext& context, + const DenseTensor& output_grad, + const DenseTensor& mask, + const std::vector& ksize, + const std::vector& strides, + const std::vector& paddings, + bool adaptive, + bool fractional, + DenseTensor* input_grad) + }; + + ``` + + 主要逻辑与 CPU 算子类似,这里不再赘述,有一个需要单独指出的是,PR:https://github.com/PaddlePaddle/Paddle/pull/45959 中,单独针对 `AdaptiveKernelMaxPool2dWithIdx` 做了优化,本次设计方案暂不进行优化方面的设计。 + +## python layer 实现 -另外,`output_size` 为 `list/tuple` 的情况,需要单独计算 `h_seq / w_seq`。 +涉及文件: -如果需要 `return_mask` 则要使用 `argmax`。 +- `python/paddle/nn/layer/pooling.py` + + ``` python + class FractionalMaxPool2D(Layer): + """ + TODO(megemini) + """ + + def __init__(self, output_size, return_mask=False, name=None): + super().__init__() + self._output_size = output_size + self._return_mask = return_mask + self._name = name + + def forward(self, x): + return F.fractional_max_pool2d( + x, + output_size=self._output_size, + return_mask=self._return_mask, + name=self._name, + ) + + def extra_repr(self): + return ( + f'output_size={self._output_size}, return_mask={self._return_mask}' + ) + + + class FractionalMaxPool3D(Layer): + """ + TODO(megemini) + """ -最后,`fractional_max_pool3d` 的实现方法类似,这里不再赘述。 + def __init__(self, output_size, return_mask=False, name=None): + super().__init__() + self._output_size = output_size + self._return_mask = return_mask + self._name = name + + def forward(self, x): + return F.fractional_max_pool3d( + x, + output_size=self._output_size, + return_mask=self._return_mask, + name=self._name, + ) + def extra_repr(self): + return ( + f'output_size={self._output_size}, return_mask={self._return_mask}' + ) + ``` + + 主要通过调用相应的方法实现。 # 六、测试和验收的考量 测试考虑的case如下: - **编程范式场景** - 常规覆盖动态图和静态图的测试场景 + - 常规覆盖动态图和静态图的测试场景 + - 需要测试 C++ 算子 + - 需要测试 python 接口 - **硬件场景** 常规需覆盖 CPU、GPU 两种测试场景 @@ -966,8 +1744,6 @@ def _pool_along_w(x, seq, overlapping): - 需要测试 2D / 3D 两类接口 - 需要测试 1 < N_in/N_out < 2, N_in/N_out > 2 的情况 - 需要测试 output_size 为 int/list/tuple 的情况 - - 需要测试 random / pseudo_random - - 需要测试 overlapping - 需要测试 return_mask - 需要测试 不同数据类型的场景 - 需要异常测试,如 N_in/N_out < 1 @@ -980,10 +1756,10 @@ def _pool_along_w(x, seq, overlapping): # 七、可行性分析及规划排期 -- 每个接口开发约 3 个工作日 +- 每个接口开发约 7 个工作日 - 每个接口测试约 3 个工作日 -计划 2~3 周的工作量可以完成接口的开发预测是。 +计划 3~4 周的工作量可以完成接口的开发预测是。 # 八、影响面 @@ -1002,125 +1778,3 @@ def _pool_along_w(x, seq, overlapping): - [TORCH.NN.FUNCTIONAL.FRACTIONAL_MAX_POOL3D](https://pytorch.org/docs/stable/generated/torch.nn.functional.fractional_max_pool3d.html#torch.nn.functional.fractional_max_pool3d) - [tf.raw_ops.FractionalMaxPool](https://tensorflow.google.cn/api_docs/python/tf/raw_ops/FractionalMaxPool?hl=en) - [tf.nn.fractional_max_pool](https://tensorflow.google.cn/api_docs/python/tf/nn/fractional_max_pool?hl=en) - -# 补充一下 - -这里有个疑问:是否可以利用 `adaptive max pooling` 的 `方式` 实现 `fractional max pooling`? - -之前也分析过: - -- `fractional max pooling` : `a = ceiling(alpha(i+u)), 1 < alpha = N_in/N_out < 2, 0 < u < 1` -- `adaptive max pooling` : `a = ceiling(alpha(i+1)), 1 < alpha = N_in/N_out < 2` - -也就是说,`adaptive max pooling` 可以看作 `fractional max pooling` 的一种特例?不知道我这里理解的是否有问题? - -如果有一个 `伪` 序列的生成方式: - -``` python -def pseudo_sequence(input_size, output_size, sample): - alpha = input_size / output_size - start_idx = [] - end_idx = [] - for i in range(output_size): - start_idx.append(math.floor(alpha * (i))) - end_idx.append(math.ceil(alpha * (i + sample))) - - return start_idx, end_idx - -``` - -其中 `sample = 1` 是否就是 `adaptive max pooling`? - -我理解的 `fractional max pooling` 其实只要通过 `1221112112...` 这种方式实现 `fractional` 缩小尺寸就达到目的了。 - -如果实现类似 `PyTorch` 的方式,只有一个 `伪` 序列,是不是就足够了?因为文章中也提到,`伪` 序列是更 `faithfully` 的一种方式。 - -只是,这里的序列也不是完全符合 `1221112112...`: - -``` python -In [191]: diffs = [] - ...: for i in range(1, 11): - ...: start_idx, end_idx = pr(input_size, output_size, i/10) - ...: start_idx = np.array(start_idx) - ...: end_idx = np.array(end_idx) - ...: diff = end_idx - start_idx - ...: diffs.append(diff) - ...: if np.any(diff <= 0): - ...: print('error', i) - ...: print(i) - ...: print(start_idx) - ...: print(end_idx) - ...: print(diff) - ...: -1 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 1 2 3 5 6 8 9 10 12 13 15 16 17 19 20 21 23 24] -[1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1] -2 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 1 2 4 5 6 8 9 10 12 13 15 16 17 19 20 22 23 24] -[1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1] -3 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23 25] -[1 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 2] -4 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 1 2 4 5 7 8 9 11 12 14 15 16 18 19 20 22 23 25] -[1 1 2 1 2 2 1 2 1 2 2 1 2 1 1 2 1 2] -5 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 1 3 4 5 7 8 10 11 12 14 15 16 18 19 21 22 23 25] -[1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 1 2] -6 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 1 3 4 5 7 8 10 11 12 14 15 17 18 19 21 22 24 25] -[1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2] -7 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 1 3 4 6 7 8 10 11 13 14 15 17 18 20 21 22 24 25] -[1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] -8 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 2 3 4 6 7 9 10 11 13 14 15 17 18 20 21 22 24 25] -[2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2] -9 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 2 3 5 6 7 9 10 11 13 14 16 17 18 20 21 23 24 25] -[2 2 3 2 2 3 2 2 2 2 3 2 2 2 2 3 2 2] -10 -[ 0 1 2 4 5 6 8 9 11 12 13 15 16 18 19 20 22 23] -[ 2 3 5 6 7 9 10 12 13 14 16 17 19 20 21 23 24 25] -[2 2 3 2 2 3 2 3 2 2 3 2 3 2 2 3 2 2] -``` - -目前 Paddle 的 `adaptive max pooling` 生成序列的文件: - -- `paddle/phi/kernels/funcs/pooling.h` - - ``` cpp - /* used for adaptive pool to calculate start and end index of each divided grid - */ - HOSTDEVICE inline int AdaptStartIndex(int ph, int input_size, int output_size) { - return static_cast( - floor(static_cast(ph * input_size) / output_size)); - } - - HOSTDEVICE inline int AdaptEndIndex(int ph, int input_size, int output_size) { - return static_cast( - ceil(static_cast((ph + 1) * input_size) / output_size)); - } - - ``` - -是不是 `AdaptEndIndex` 增加一个参数,把 `1` 换成随机数,便可以给 `adaptive max pooling` 和 `fractional max pooling` 使用? - -具体实现的时候,不需要单独实现 `fractional max pooling`,只需要在目前的代码上修改 `adaptive max pooling` 并作分流? - -以上是我对目前实现方式的一点补充思考,之所以有这种想法,主要是因为,目前如果以纯 python 实现 `fractional max pooling`,感觉跟其他池化方法不太搭 ... ... - -还有很多不完善的地方,希望能够讨论一下 ~ - -谢谢! - - From 644c6db4bff8458329e8fab282cd7822d84f6cf7 Mon Sep 17 00:00:00 2001 From: megemini Date: Tue, 14 Nov 2023 19:35:54 +0800 Subject: [PATCH 3/6] [Add] fractional seq cpp --- ...1009_api_design_for_fractional_max_pool.md | 239 ++++++++++++++++++ 1 file changed, 239 insertions(+) diff --git a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md index 3e1364677..fbf69edc7 100644 --- a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md +++ b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md @@ -1669,6 +1669,245 @@ v2.0: 将实现方式由 python 改为 c++ 主要逻辑与 CPU 算子类似,这里不再赘述,有一个需要单独指出的是,PR:https://github.com/PaddlePaddle/Paddle/pull/45959 中,单独针对 `AdaptiveKernelMaxPool2dWithIdx` 做了优化,本次设计方案暂不进行优化方面的设计。 +### 池化序列的生成方法 + +这里编写了一个简化的程序,以演示如何生成 fractional 的池化序列: + +``` cpp +#include +#include + +inline int AdaptStartIndex(int ph, int input_size, int output_size) { + return static_cast( + floor(static_cast(ph * input_size) / output_size)); +} + +inline int AdaptEndIndex(int ph, int input_size, int output_size) { + return static_cast( + ceil(static_cast((ph + 1) * input_size) / output_size)); +} + +inline int FractionalStartIndex(int ph, double alpha, double u) { + return static_cast( + // subtract `1` for index from `0` + ceil(alpha * (ph + u) - 1)); +} + +inline int FractionalEndIndex(int ph, double alpha, double u) { + return static_cast( + // subtract `1` for index from `0` + ceil(alpha * (ph + 1 + u) - 1)) ; +} + + +int main() +{ + int input_height = 32; + int output_height = 25; + for (int ph = 0; ph < output_height; ++ph) { + + int hstart = AdaptStartIndex(ph, input_height, output_height); + int hend = AdaptEndIndex(ph, input_height, output_height); + + std::cout << "------------" << std::endl; + std::cout << "ph " << ph << std::endl; + std::cout << "hstart " << hstart << " hend " << hend << " diff " << hend - hstart << std::endl; + } + + std::cout << "====================" << std::endl; + + double alpha = static_cast(input_height) / output_height; + int base = input_height / output_height; + + double u_max1 = (base + 2) / alpha - 1; + double u_max2 = (input_height + 1 - base) / alpha - (output_height - 1); + double max_u = std::min(u_max1, u_max2); + + double u = 0.8 * max_u; + + for (int ph = 0; ph < output_height; ++ph) { + + int hstart = FractionalStartIndex(ph, alpha, u); + int hend = FractionalEndIndex(ph, alpha, u); + hend = std::min(hend, input_height); + + std::cout << "------------" << std::endl; + std::cout << "ph " << ph << std::endl; + std::cout << "hstart " << hstart << " hend " << hend << " diff " << hend - hstart << std::endl; + } + + std::cout << "alpha is " << alpha << " u is " << u << " max u is " << max_u << std::endl; +} + +``` + +运行后得到结果: + +``` shell +$> g++ n38_index.cc -Wall && ./a.out +------------ +ph 0 +hstart 0 hend 2 diff 2 +------------ +ph 1 +hstart 1 hend 3 diff 2 +------------ +ph 2 +hstart 2 hend 4 diff 2 +------------ +ph 3 +hstart 3 hend 6 diff 3 +------------ +ph 4 +hstart 5 hend 7 diff 2 +------------ +ph 5 +hstart 6 hend 8 diff 2 +------------ +ph 6 +hstart 7 hend 9 diff 2 +------------ +ph 7 +hstart 8 hend 11 diff 3 +------------ +ph 8 +hstart 10 hend 12 diff 2 +------------ +ph 9 +hstart 11 hend 13 diff 2 +------------ +ph 10 +hstart 12 hend 15 diff 3 +------------ +ph 11 +hstart 14 hend 16 diff 2 +------------ +ph 12 +hstart 15 hend 17 diff 2 +------------ +ph 13 +hstart 16 hend 18 diff 2 +------------ +ph 14 +hstart 17 hend 20 diff 3 +------------ +ph 15 +hstart 19 hend 21 diff 2 +------------ +ph 16 +hstart 20 hend 22 diff 2 +------------ +ph 17 +hstart 21 hend 24 diff 3 +------------ +ph 18 +hstart 23 hend 25 diff 2 +------------ +ph 19 +hstart 24 hend 26 diff 2 +------------ +ph 20 +hstart 25 hend 27 diff 2 +------------ +ph 21 +hstart 26 hend 29 diff 3 +------------ +ph 22 +hstart 28 hend 30 diff 2 +------------ +ph 23 +hstart 29 hend 31 diff 2 +------------ +ph 24 +hstart 30 hend 32 diff 2 +==================== +------------ +ph 0 +hstart 1 hend 2 diff 1 +------------ +ph 1 +hstart 2 hend 3 diff 1 +------------ +ph 2 +hstart 3 hend 4 diff 1 +------------ +ph 3 +hstart 4 hend 6 diff 2 +------------ +ph 4 +hstart 6 hend 7 diff 1 +------------ +ph 5 +hstart 7 hend 8 diff 1 +------------ +ph 6 +hstart 8 hend 9 diff 1 +------------ +ph 7 +hstart 9 hend 11 diff 2 +------------ +ph 8 +hstart 11 hend 12 diff 1 +------------ +ph 9 +hstart 12 hend 13 diff 1 +------------ +ph 10 +hstart 13 hend 15 diff 2 +------------ +ph 11 +hstart 15 hend 16 diff 1 +------------ +ph 12 +hstart 16 hend 17 diff 1 +------------ +ph 13 +hstart 17 hend 18 diff 1 +------------ +ph 14 +hstart 18 hend 20 diff 2 +------------ +ph 15 +hstart 20 hend 21 diff 1 +------------ +ph 16 +hstart 21 hend 22 diff 1 +------------ +ph 17 +hstart 22 hend 24 diff 2 +------------ +ph 18 +hstart 24 hend 25 diff 1 +------------ +ph 19 +hstart 25 hend 26 diff 1 +------------ +ph 20 +hstart 26 hend 27 diff 1 +------------ +ph 21 +hstart 27 hend 29 diff 2 +------------ +ph 22 +hstart 29 hend 30 diff 1 +------------ +ph 23 +hstart 30 hend 31 diff 1 +------------ +ph 24 +hstart 31 hend 32 diff 1 +alpha is 1.28 u is 0.8 max u is 1 + +``` + +可以看到 +- adaptive 的池化序列为 `2...3...` 的样式,fractional 的池化序列为 `1...2...` 的样式。 +- adaptive 的池化序列存在 index 交叉,而 fractional 不存在交叉。 + +另外: +- `FractionalStrartIndex` 和 `FractionalEndIndex` 需要减去 `1`,因为根据论文中的算法要求,使用 `ceil`,将使 index 从 `1` 开始,所以这里需要减去 `1`。 +- `hend = std::min(hend, input_height);` 这里需要与 input 比对取小值,同样是由于 `ceil` 导致。 + ## python layer 实现 涉及文件: From 8ba69c89fbc11a22200ab72174d5d265ffb59a0e Mon Sep 17 00:00:00 2001 From: megemini Date: Thu, 16 Nov 2023 14:16:07 +0800 Subject: [PATCH 4/6] [Update] about pytorch kernel_size --- rfcs/APIs/20231009_api_design_for_fractional_max_pool.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md index fbf69edc7..d0c4aed3d 100644 --- a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md +++ b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md @@ -748,6 +748,7 @@ v2.0: 将实现方式由 python 改为 c++ 这是 `PyTorch` 与 `TensorFlow` 最大的不同点。文章中的 `fractional` 根据 `N_in/N_out` 得出,也就是说,只需要这两个参数即可。 `PyTorch` 提供了 `kernel_size`、`output_size`、`output_ratio` 这三个参数,这三个参数都可以影响 `N_in/N_out`,这更像是传统池化的方法。 `TensorFlow` 只提供了 `pooling_ratio`,利用这个参数即可得到 `N_out`,而且提供了 `overlapping` 参数,利用这个参数可以影响 `kernel_size`。而且,由此可以看出,`TensorFlow` 实现的 `fractional max pooling` 更具有一般性,而 `adaptive max pooling` 则可以看作 `fractional max pooling` 的一种特例。 + `PyTorch` 只利用随机序列作为 stride,而不是同时将其作为 kernel 进行池化,`TensorFlow` 将随机序列既作为 stride 同时也作为 kernel 进行池化,更符合论文中的描述方式,所以,这里以 `TensorFlow` 的方式进行实现。 - `fractional max pooling` : `a = ceiling(alpha(i+u)), 1 < alpha = N_in/N_out < 2, 0 < u < 1` - `adaptive max pooling` : `a = ceiling(alpha(i+1)), 1 < alpha = N_in/N_out < 2` From 34fb745007154ffbc20586435bdbbea4abe4ed70 Mon Sep 17 00:00:00 2001 From: megemini Date: Fri, 12 Jan 2024 19:32:45 +0800 Subject: [PATCH 5/6] [Change] signature --- ...1009_api_design_for_fractional_max_pool.md | 83 +++++++++---------- 1 file changed, 39 insertions(+), 44 deletions(-) diff --git a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md index d0c4aed3d..24bf7a362 100644 --- a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md +++ b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md @@ -3,13 +3,14 @@ | API 名称 | FractionalMaxPool2d / FractionalMaxPool3d | | - | - | | 提交作者 | megemini(柳顺) | -| 提交时间 | 2023-10-09 | -| 版本号 | V2.0 | +| 提交时间 | 2024-01-12 | +| 版本号 | V2.1 | | 依赖飞桨版本 | develop | | 文件名 | 20231009_api_design_for_fractional_max_pool.md | #### 修订记录 v2.0: 将实现方式由 python 改为 c++ +v2.1: 修改接口签名 # 一、概述 @@ -783,8 +784,10 @@ v2.0: 将实现方式由 python 改为 c++ ``` python paddle.nn.functional.fractional_max_pool2d( - x:Tensor, + x:Tensor, output_size:Union[int, list, tuple], + kernel_size:Optional[Union[int, list, tuple]]=None, + random_u:Optional[float]=None, return_mask:bool=False, name:str=None) ``` @@ -792,6 +795,8 @@ v2.0: 将实现方式由 python 改为 c++ - 参数列表 > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 > output_size (int|list|tuple) – 输出的尺寸。 + > kernel_size (int|list|tuple, optional) – 核大小。 + > random_u (float, optional) – 随机序列所需随机数。 > return_mask (bool, optional) – 是否返回最大值的索引。 > name (str, optional) – 操作名称。 @@ -804,8 +809,10 @@ v2.0: 将实现方式由 python 改为 c++ ``` python paddle.nn.functional.fractional_max_pool3d( - x:Tensor, + x:Tensor, output_size:Union[int, list, tuple], + kernel_size:Optional[Union[int, list, tuple]]=None, + random_u:Optional[float]=None, return_mask:bool=False, name:str=None) ``` @@ -813,6 +820,8 @@ v2.0: 将实现方式由 python 改为 c++ - 参数列表 > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 > output_size (int|list|tuple) – 输出的尺寸。 + > kernel_size (int|list|tuple, optional) – 核大小。 + > random_u (float, optional) – 随机序列所需随机数。 > return_mask (bool, optional) – 是否返回最大值的索引。 > name (str, optional) – 操作名称。 @@ -843,10 +852,27 @@ v2.0: 将实现方式由 python 改为 c++ - 移除 `pseudo_random`, `overlapping`, `seed` - 由于 `共用 adaptive max pooling 底层算子`,且参考 `PyTorch` 的设计方案,这里将只使用 `伪` 随机的方式生成池化序列,并在 c++ 算子内部实现。 + 参考 `PyTorch` 的设计方案,这里将只使用 `伪` 随机的方式生成池化序列,并在 c++ 算子内部实现。 + +*注意* : 相较 v2.0 版本的设计文档,这里增加多个参数,特说明如下: + +- `kernel_size` + + 此参数默认为 `None`,表示使用 `disjoint(non-overlapping)` 模式。 + 当此参数不为 `None` 时,使用 `overlapping` 模式,与 PyTorch 的实现保持一致。此处参考 Fractional Max-Pool 作者 Benjamin Graham 的解释: + + > Hello. My original implementation (for sparse ConvNets) generated regions using this code:https://github.com/btgraham/SparseConvNet-archived/blob/bdde325c28f64b895cebfdbe301a2ddca7870174/SparseConvNet/Regions.cu#L31 + + 并与作者提供的代码保持一致。 + +- `random_u` + + 增加随机序列所需的随机数参数,以方便进行复现。 ## 底层 OP 设计 +> *注意* 以下具体实现以实际代码为准。 + 涉及文件: - `paddle/phi/api/yaml/ops.yaml` 算子描述及定义 @@ -1921,49 +1947,18 @@ alpha is 1.28 u is 0.8 max u is 1 TODO(megemini) """ - def __init__(self, output_size, return_mask=False, name=None): - super().__init__() - self._output_size = output_size - self._return_mask = return_mask - self._name = name - - def forward(self, x): - return F.fractional_max_pool2d( - x, - output_size=self._output_size, - return_mask=self._return_mask, - name=self._name, - ) - - def extra_repr(self): - return ( - f'output_size={self._output_size}, return_mask={self._return_mask}' - ) - + def __init__(self, output_size, kernel_size=None, random_u=None, return_mask=False, name=None): + super().__init__() + ... class FractionalMaxPool3D(Layer): - """ - TODO(megemini) - """ + """ + TODO(megemini) + """ - def __init__(self, output_size, return_mask=False, name=None): + def __init__(self, output_size, kernel_size=None, random_u=None, return_mask=False, name=None): super().__init__() - self._output_size = output_size - self._return_mask = return_mask - self._name = name - - def forward(self, x): - return F.fractional_max_pool3d( - x, - output_size=self._output_size, - return_mask=self._return_mask, - name=self._name, - ) - - def extra_repr(self): - return ( - f'output_size={self._output_size}, return_mask={self._return_mask}' - ) + ... ``` 主要通过调用相应的方法实现。 From eccc47c93bffc075b11a783f6dcff8da135573b1 Mon Sep 17 00:00:00 2001 From: megemini Date: Mon, 15 Jan 2024 15:42:46 +0800 Subject: [PATCH 6/6] [Change] implements desc --- ...1009_api_design_for_fractional_max_pool.md | 898 ++++-------------- 1 file changed, 162 insertions(+), 736 deletions(-) diff --git a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md index 24bf7a362..324145ed0 100644 --- a/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md +++ b/rfcs/APIs/20231009_api_design_for_fractional_max_pool.md @@ -9,6 +9,7 @@ | 文件名 | 20231009_api_design_for_fractional_max_pool.md | #### 修订记录 + v2.0: 将实现方式由 python 改为 c++ v2.1: 修改接口签名 @@ -26,6 +27,7 @@ v2.1: 修改接口签名 在一个由多个通道组成的输入信号上施加分数最大池化。分数最大池化请参考论文 [《Fractional Max-Pooling》](https://arxiv.org/abs/1412.6071) 调用形式 + - `paddle.nn.FractionalMaxPool2d` - `paddle.nn.FractionalMaxPool3d` - `paddle.nn.functional.fractional_max_pool2d` @@ -59,8 +61,8 @@ v2.1: 修改接口签名 随机生成 `1` 和 `2` 的序列,只要满足: - - 序列长度为 `output_size` - - 序列累加和为 `input_size` + - 序列长度为 `output_size` + - 序列累加和为 `input_size` - `伪` 随机(`pseudo random`) @@ -87,66 +89,64 @@ v2.1: 修改接口签名 - [TORCH.NN.FUNCTIONAL.FRACTIONAL_MAX_POOL2D](https://pytorch.org/docs/stable/generated/torch.nn.functional.fractional_max_pool2d.html#torch.nn.functional.fractional_max_pool2d) - [TORCH.NN.FUNCTIONAL.FRACTIONAL_MAX_POOL3D](https://pytorch.org/docs/stable/generated/torch.nn.functional.fractional_max_pool3d.html#torch.nn.functional.fractional_max_pool3d) - 相应接口为: - `torch.nn.FractionalMaxPool2d(kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` - - 文档描述 + - 文档描述 > Applies 2D fractional max pooling over an input signal composed of several input planes. - - 参数列表 + - 参数列表 > kernel_size – the size of the window to take a max over. > output_size – the target output size > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) > return_indices – if True, will return the indices along with the outputs. - - 返回值 + - 返回值 > output (Tensor) - `torch.nn.FractionalMaxPool3d(kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` - - 文档描述 + - 文档描述 > Applies 3D fractional max pooling over an input signal composed of several input planes. - - 参数列表 + - 参数列表 > kernel_size – the size of the window to take a max over. > output_size – the target output size > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) > return_indices – if True, will return the indices along with the outputs. - - 返回值 + - 返回值 > output (Tensor) - `torch.nn.functional.fractional_max_pool2d(input, kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` - - 文档描述 + - 文档描述 > Applies 2D fractional max pooling over an input signal composed of several input planes. - - 参数列表 + - 参数列表 > kernel_size – the size of the window to take a max over. > output_size – the target output size > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) > return_indices – if True, will return the indices along with the outputs. - - 返回值 + - 返回值 > output (Tensor) - `torch.nn.functional.fractional_max_pool3d(input, kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)` - - 文档描述 + - 文档描述 > Applies 3D fractional max pooling over an input signal composed of several input planes. - - 参数列表 + - 参数列表 > kernel_size – the size of the window to take a max over. > output_size – the target output size > output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1) > return_indices – if True, will return the indices along with the outputs. - - 返回值 + - 返回值 > output (Tensor) - 实现逻辑: 由于 `fractional_max_pool2d` 与 `fractional_max_pool3d` 最大的区别是维度,其他逻辑基本相同,所以,后续以 `fractional_max_pool2d` 为主要分析对象。 @@ -242,7 +242,6 @@ v2.1: 修改接口签名 这里根据是否需要 `indices` 对接口进行分发,最终都是调用 `fractional_max_pool2d_with_indices`。 - - `torch/csrc/api/include/torch/nn/functional/pooling.h` 上面的接口会调用这里对应的 c++ 实现: @@ -284,10 +283,9 @@ v2.1: 修改接口签名 这里是 `fractional_max_pool2d` 主要入口,主要做了以下几处处理: - - 如果没有 `output_size`,根据 `output_ratio` 生成 `output_size` - - 如果没有 `_random_samples`,根据输入的维度生成随机序列 - - 调用主要方法 `torch::fractional_max_pool2d(input, kernel_size, *output_size_, _random_samples_);}` - + - 如果没有 `output_size`,根据 `output_ratio` 生成 `output_size` + - 如果没有 `_random_samples`,根据输入的维度生成随机序列 + - 调用主要方法 `torch::fractional_max_pool2d(input, kernel_size, *output_size_, _random_samples_);}` - `aten/src/ATen/native/FractionalMaxPool2d.cpp` @@ -357,11 +355,11 @@ v2.1: 修改接口签名 ``` 此文件实现了 `fractional_max_pool2d` 的主要逻辑,上面只摘抄了最关键的代码。 - + 主要逻辑为: - - 生成采样的序列 - - 获取序列中的每个 pool 中的最大值 + - 生成采样的序列 + - 获取序列中的每个 pool 中的最大值 其中,生成采样序列的逻辑在 `aten/src/ATen/native/FractionalMaxPooling.h`: @@ -391,7 +389,6 @@ v2.1: 修改接口签名 从上面的源代码分析可以看到,`PyTorch` 对于 `fractional_max_pool` 只实现了 `pseudo random` 的方式,而没有 `random` 的方式。 - ## TensorFlow `TensorFlow` 实现了 `tf.nn.fractional_max_pool` 函数,对应 `PyTorch` 的函数为 `fractional_max_pool2d`。 @@ -411,36 +408,36 @@ v2.1: 修改接口签名 - `tf.raw_ops.FractionalMaxPool` - - 文档描述 + - 文档描述 > Performs fractional max pooling on the input. - - 参数列表 + - 参数列表 > value – A Tensor. 4-D with shape [batch, height, width, channels]. - > pooling_ratio – An int or list of ints that has length 1, 2 or 4. + > pooling_ratio – An int or list of ints that has length 1, 2 or 4. > pseudo_random – An optional bool. Defaults to False. When set to True, generates the pooling sequence in a pseudorandom fashion, otherwise, in a random fashion. > overlapping – An optional bool. Defaults to False. When set to True, it means when pooling, the values at the boundary of adjacent pooling cells are used by both cells. - > deterministic – An optional bool. Defaults to False. When set to True, a fixed pooling region will be used when iterating over a FractionalMaxPool node in the computation graph. + > deterministic – An optional bool. Defaults to False. When set to True, a fixed pooling region will be used when iterating over a FractionalMaxPool node in the computation graph. > seed – An optional int. Defaults to 0. If set to be non-zero, the random number generator is seeded by the given seed. Otherwise it is seeded by a random seed. > seed2 – An optional int. Defaults to 0. An second seed to avoid seed collision. > name – A name for the operation (optional). - - 返回值 + - 返回值 > output (A tuple of Tensor objects) - `tf.nn.fractional_max_pool` - - 文档描述 + - 文档描述 > Performs fractional max pooling on the input. - - 参数列表 + - 参数列表 > value – A Tensor. 4-D with shape [batch, height, width, channels]. - > pooling_ratio – An int or list of ints that has length 1, 2 or 4. + > pooling_ratio – An int or list of ints that has length 1, 2 or 4. > pseudo_random – An optional bool. Defaults to False. When set to True, generates the pooling sequence in a pseudorandom fashion, otherwise, in a random fashion. > overlapping – An optional bool. Defaults to False. When set to True, it means when pooling, the values at the boundary of adjacent pooling cells are used by both cells. > seed – An optional int. Defaults to 0. If set to be non-zero, the random number generator is seeded by the given seed. Otherwise it is seeded by a random seed. > name – A name for the operation (optional). - - 返回值 + - 返回值 > output (A tuple of Tensor objects) 实现逻辑: @@ -507,10 +504,10 @@ v2.1: 修改接口签名 ``` 可以看到,与 `PyTorch` 不同的是,`TensorFlow` 多了几个参数: - - - `overlapping` 控制 pool 边界是否计算在内 - - `pseudo_random` 是否是伪随机 - - `seed` 随机种子 + + - `overlapping` 控制 pool 边界是否计算在内 + - `pseudo_random` 是否是伪随机 + - `seed` 随机种子 - `tensorflow/core/kernels/fractional_max_pool_op.cc` @@ -730,7 +727,6 @@ v2.1: 修改接口签名 这里根据 `pseudo_random` 标记为生成 `伪` 随机序列,或者 `真` 随机序列。 - # 四、对比分析 抛开 `PyTorch` 与 `TensorFlow` 对于 API 的组织方式不同来说,两者: @@ -751,8 +747,8 @@ v2.1: 修改接口签名 `TensorFlow` 只提供了 `pooling_ratio`,利用这个参数即可得到 `N_out`,而且提供了 `overlapping` 参数,利用这个参数可以影响 `kernel_size`。而且,由此可以看出,`TensorFlow` 实现的 `fractional max pooling` 更具有一般性,而 `adaptive max pooling` 则可以看作 `fractional max pooling` 的一种特例。 `PyTorch` 只利用随机序列作为 stride,而不是同时将其作为 kernel 进行池化,`TensorFlow` 将随机序列既作为 stride 同时也作为 kernel 进行池化,更符合论文中的描述方式,所以,这里以 `TensorFlow` 的方式进行实现。 - - `fractional max pooling` : `a = ceiling(alpha(i+u)), 1 < alpha = N_in/N_out < 2, 0 < u < 1` - - `adaptive max pooling` : `a = ceiling(alpha(i+1)), 1 < alpha = N_in/N_out < 2` + - `fractional max pooling` : `a = ceiling(alpha(i+u)), 1 < alpha = N_in/N_out < 2, 0 < u < 1` + - `adaptive max pooling` : `a = ceiling(alpha(i+1)), 1 < alpha = N_in/N_out < 2` 另外,两者都有反向梯度的计算(由于不影响主要逻辑分析,且代码较多,上述代码分析没有具体列出)。 @@ -762,7 +758,6 @@ v2.1: 修改接口签名 为了保持一致性,这里也只使用 `output_size` 一个必要参数,实现方法更接近文章以及 `TensorFlow`。 - # 五、设计思路与实现方案 本方案共涉及三部分: @@ -771,7 +766,7 @@ v2.1: 修改接口签名 - 底层 OP 设计 - python layer 实现 : `paddle.nn.FractionalMaxPool2d`, `paddle.nn.FractionalMaxPool3d` -由于 `fractional max pooling` 与 `adaptive max pooling` 接口特性较为相似,后续设计方案以 `共用 adaptive max pooling 底层算子` 为主要设计思路。 +由于 `fractional max pooling` 与 `adaptive max pooling` 接口特性较为相似,后续设计方案参考 `adaptive max pooling` 算子的实现方式,并单独实现 `fractional max pooling` 算子。 ## 命名与参数设计 (python API) @@ -792,7 +787,7 @@ v2.1: 修改接口签名 name:str=None) ``` - - 参数列表 + - 参数列表 > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 > output_size (int|list|tuple) – 输出的尺寸。 > kernel_size (int|list|tuple, optional) – 核大小。 @@ -800,13 +795,13 @@ v2.1: 修改接口签名 > return_mask (bool, optional) – 是否返回最大值的索引。 > name (str, optional) – 操作名称。 - - 返回值 + - 返回值 > Tensor, return_mask=False > Tensor and mask, return_mask=True - `paddle.nn.functional.fractional_max_pool3d` - `paddle.nn.FractionalMaxPool3d` - + ``` python paddle.nn.functional.fractional_max_pool3d( x:Tensor, @@ -817,7 +812,7 @@ v2.1: 修改接口签名 name:str=None) ``` - - 参数列表 + - 参数列表 > x (Tensor) – 输入的一个 Tensor。数据类型支持:float32、float64、int32、int64。 > output_size (int|list|tuple) – 输出的尺寸。 > kernel_size (int|list|tuple, optional) – 核大小。 @@ -825,7 +820,7 @@ v2.1: 修改接口签名 > return_mask (bool, optional) – 是否返回最大值的索引。 > name (str, optional) – 操作名称。 - - 返回值 + - 返回值 > Tensor, return_mask=False > Tensor and mask, return_mask=True @@ -837,9 +832,9 @@ v2.1: 修改接口签名 分析目前 pooling 接口主要源文件 `python/paddle/nn/functional/pooling.py`,以 `max_pool2d` 为例: - - 主要涉及两个底层算子: `max_pool2d_with_index` 和 `pool2d` - - 其中 `max_pool2d_with_index` 可以返回 `mask`,`pool2d` 不可以返回 `mask` - - 其中 `max_pool2d_with_index` 不支持 `data_format`,`pool2d` 支持 `data_format` + - 主要涉及两个底层算子: `max_pool2d_with_index` 和 `pool2d` + - 其中 `max_pool2d_with_index` 可以返回 `mask`,`pool2d` 不可以返回 `mask` + - 其中 `max_pool2d_with_index` 不支持 `data_format`,`pool2d` 支持 `data_format` 因此,当使用 `return_mask` 返回 `mask` 时,`data_format must be set to NCHW`。 没有一个算子能够完整支持这两个参数,这是目前 pooling 底层算子较大的矛盾。 @@ -850,7 +845,7 @@ v2.1: 修改接口签名 不使用 `data_format` 参数。 -- 移除 `pseudo_random`, `overlapping`, `seed` +- 移除 `pseudo_random`, `overlapping`, `seed` 参考 `PyTorch` 的设计方案,这里将只使用 `伪` 随机的方式生成池化序列,并在 c++ 算子内部实现。 @@ -861,7 +856,7 @@ v2.1: 修改接口签名 此参数默认为 `None`,表示使用 `disjoint(non-overlapping)` 模式。 当此参数不为 `None` 时,使用 `overlapping` 模式,与 PyTorch 的实现保持一致。此处参考 Fractional Max-Pool 作者 Benjamin Graham 的解释: - > Hello. My original implementation (for sparse ConvNets) generated regions using this code:https://github.com/btgraham/SparseConvNet-archived/blob/bdde325c28f64b895cebfdbe301a2ddca7870174/SparseConvNet/Regions.cu#L31 + > Hello. My original implementation (for sparse ConvNets) generated regions using this code: 并与作者提供的代码保持一致。 @@ -878,47 +873,30 @@ v2.1: 修改接口签名 - `paddle/phi/api/yaml/ops.yaml` 算子描述及定义 ``` yaml - - op : max_pool2d_with_index - args : (Tensor x, int[] kernel_size, int[] strides= {1, 1}, int[] paddings = {0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) - output : Tensor(out), Tensor(mask) - infer_meta : - func : MaxPoolWithIndexInferMeta - kernel : - func : max_pool2d_with_index - backward : max_pool2d_with_index_grad - - - op : max_pool3d_with_index - args : (Tensor x, int[] kernel_size, int[] strides = {1, 1, 1}, int[] paddings = {0, 0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) + - op : fractional_max_pool2d + args : (Tensor x, int[] output_size, int[] kernel_size = {0, 0}, float random_u = 0.0, bool return_mask = true) output : Tensor(out), Tensor(mask) infer_meta : - func : MaxPoolWithIndexInferMeta + func : FractionalMaxPoolInferMeta kernel : - func : max_pool3d_with_index - backward : max_pool3d_with_index_grad + func : fractional_max_pool2d + backward : fractional_max_pool2d_grad ``` - + 增加 `bool` 类型 `fractional` 参数,默认为 `false` - `paddle/phi/api/yaml/backward.yaml` 算子描述及定义 ``` yaml - - backward_op : max_pool2d_with_index_grad - forward : max_pool2d_with_index(Tensor x, int[] kernel_size, int[] strides = {1, 1}, int[] paddings = {0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) -> Tensor(out), Tensor(mask) - args : (Tensor x, Tensor mask, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive, bool fractional) + - backward_op : fractional_max_pool2d_grad + forward : fractional_max_pool2d(Tensor x, int[] output_size, int[] kernel_size = {0, 0}, float random_u = 0.0, bool return_mask = true) -> Tensor(out), Tensor(mask) + args : (Tensor x, Tensor mask, Tensor out_grad, int[] output_size, int[] kernel_size, float random_u, bool return_mask) output : Tensor(x_grad) infer_meta : - func : MaxPoolWithIndexGradInferMeta + func : UnchangedInferMeta + param : [x] kernel : - func : max_pool2d_with_index_grad - - - backward_op : max_pool3d_with_index_grad - forward : max_pool3d_with_index(Tensor x, int[] kernel_size, int[] strides = {1, 1, 1}, int[] paddings = {0, 0, 0}, bool global_pooling = false, bool adaptive = false, bool fractional = false) -> Tensor(out), Tensor(mask) - args : (Tensor x, Tensor mask, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive, bool fractional) - output : Tensor(x_grad) - infer_meta : - func : MaxPoolWithIndexGradInferMeta - kernel : - func : max_pool3d_with_index_grad + func : fractional_max_pool2d_grad ``` 增加 `bool` 类型 `fractional` 参数,默认为 `false` @@ -926,175 +904,62 @@ v2.1: 修改接口签名 - `paddle/phi/infermeta/unary.h` 算子 InferMeta ``` cpp - void MaxPoolWithIndexInferMeta(const MetaTensor& x, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - MetaTensor* out, - MetaTensor* mask, - MetaConfig config = MetaConfig()); - ``` - 增加 `fractional` 参数 - -- `paddle/phi/infermeta/unary.cc` - - ``` cpp - void MaxPoolWithIndexInferMeta(const MetaTensor& x, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - MetaTensor* out, - MetaTensor* mask, - MetaConfig config) { - ... - if (adaptive || fractional) { - output_shape.insert( - output_shape.end(), kernel_size_.begin(), kernel_size_.end()); - } else { - ... - } - ... - } - ``` - - 增加 `fractional` 参数,并且,与 `adaptive` 一样,共用 `kernel_size_` 参数,此参数在此实际为 `output_size`。 - -- `paddle/phi/infermeta/backward.h` - - ``` cpp - void MaxPoolWithIndexGradInferMeta(const MetaTensor& x, - const MetaTensor& mask, - const MetaTensor& dout, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - MetaTensor* dx); - ``` - - 增加 `fractional` 参数。 - -- `paddle/phi/infermeta/backward.cc` - - ``` cpp - void MaxPoolWithIndexGradInferMeta(const MetaTensor& x, - const MetaTensor& mask, - const MetaTensor& dout, + void FractionalMaxPoolInferMeta(const MetaTensor& x, + const std::vector& output_size, const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - MetaTensor* dx) { - dx->share_meta(x); - } + float random_u, + bool return_mask, + MetaTensor* out, + MetaTensor* mask, + MetaConfig config = MetaConfig()); ``` - 增加 `fractional` 参数。 - - `paddle/phi/kernels/pool_kernel.h` 算子 Kernel ``` cpp template - void MaxPool2dWithIndexKernel(const Context& ctx, + void FractionalMaxPool2dKernel(const Context& ctx, const DenseTensor& x, + const std::vector& output_size, const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - DenseTensor* out, - DenseTensor* mask); - - template - void MaxPool3dWithIndexKernel(const Context& ctx, - const DenseTensor& x, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, + float random_u, + bool return_mask, DenseTensor* out, DenseTensor* mask); ``` - 增加 `fractional` 参数。 - - - `paddle/phi/kernels/funcs/pooling.h` ``` cpp template - class MaxPool2dWithIndexFunctor { - public: - void operator()(const Context& context, - const DenseTensor& input, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, - DenseTensor* output, - DenseTensor* mask); - }; - - template - class MaxPool2dWithIndexGradFunctor { - public: - void operator()(const Context& context, - const DenseTensor& output_grad, - const DenseTensor& mask, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, - DenseTensor* input_grad); - }; - - template - class MaxPool3dWithIndexFunctor { + class FractionalMaxPool2dFunctor { public: void operator()(const Context& context, const DenseTensor& input, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, + const std::vector& output_size, + const std::vector& kernel_size, + float random_u, + bool return_mask, DenseTensor* output, DenseTensor* mask); }; template - class MaxPool3dWithIndexGradFunctor { + class FractionalMaxPool2dGradFunctor { public: void operator()(const Context& context, const DenseTensor& output_grad, const DenseTensor& mask, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, + const std::vector& output_size, + const std::vector& kernel_size, + float random_u, + bool return_mask, DenseTensor* input_grad); }; ``` - 增加 `fractional` 参数。 - ``` cpp + HOSTDEVICE inline float FractionalRationalU() HOSTDEVICE inline int FractionalStartIndex() HOSTDEVICE inline int FractionalEndIndex() ``` @@ -1104,420 +969,87 @@ v2.1: 修改接口签名 - `paddle/phi/kernels/impl/pool_kernel_impl.h` ``` cpp - template - void MaxPoolWithIndexRawKernel(const Context& ctx, - const DenseTensor& x, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - DenseTensor* out, - DenseTensor* mask) { - std::vector paddings_ = paddings; - std::vector kernel_size_ = kernel_size; - - if (global_pooling) { - for (size_t i = 0; i < kernel_size_.size(); ++i) { - paddings_[i] = 0; - kernel_size_[i] = static_cast(x.dims()[i + 2]); - } - } - - switch (kernel_size_.size()) { - case 2: { - funcs::MaxPool2dWithIndexFunctor pool2d_forward; - pool2d_forward(ctx, - x, - kernel_size_, - strides, - paddings_, - adaptive, - fractional, - out, - mask); - } break; - case 3: { - funcs::MaxPool3dWithIndexFunctor pool3d_forward; - pool3d_forward(ctx, - x, - kernel_size_, - strides, - paddings_, - adaptive, - fractional, - out, - mask); - } break; - default: { - PADDLE_THROW( - errors::InvalidArgument("Pool op only supports 2D and 3D input.")); - } - } - } - - template - void MaxPool2dWithIndexKernel(const Context& ctx, - const DenseTensor& x, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - DenseTensor* out, - DenseTensor* mask) { - MaxPoolWithIndexRawKernel(ctx, - x, - kernel_size, - strides, - paddings, - global_pooling, - adaptive, - fractional, - out, - mask); - } - - template - void MaxPool3dWithIndexKernel(const Context& ctx, - const DenseTensor& x, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - DenseTensor* out, - DenseTensor* mask) { - MaxPoolWithIndexRawKernel(ctx, - x, - kernel_size, - strides, - paddings, - global_pooling, - adaptive, - fractional, - out, - mask); - } - + void FractionalMaxPoolRawKernel(const Context& ctx, + const DenseTensor& x, + const std::vector& output_size, + const std::vector& kernel_size, + float random_u, + bool return_mask, + DenseTensor* out, + DenseTensor* mask) ``` - 增加 `fractional` 参数,分发方法时带上 `fracional`。 - - `paddle/phi/kernels/pool_grad_kernel.h` 反向算子 ``` cpp template - void MaxPool2dWithIndexGradKernel(const Context& ctx, + void FractionalMaxPool2dGradKernel(const Context& ctx, const DenseTensor& x, const DenseTensor& mask, const DenseTensor& dout, + const std::vector& output_size, const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fracional, + float random_u, + bool return_mask, DenseTensor* dx); template - void MaxPool3dWithIndexGradKernel(const Context& ctx, + void FractionalMaxPool3dGradKernel(const Context& ctx, const DenseTensor& x, const DenseTensor& mask, const DenseTensor& dout, + const std::vector& output_size, const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fracional, + float random_u, + bool return_mask, DenseTensor* dx); - ``` - 增加 `fractional` 参数。 - - `paddle/phi/kernels/impl/pool_grad_kernel_impl.h` ``` cpp template - void MaxPoolWithIndexGradRawKernel(const Context& ctx, - const DenseTensor& x UNUSED, - const DenseTensor& mask, - const DenseTensor& dout, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - DenseTensor* dx) { - std::vector paddings_ = paddings; - std::vector kernel_size_ = kernel_size; - - if (global_pooling) { - for (size_t i = 0; i < kernel_size_.size(); ++i) { - paddings_[i] = 0; - kernel_size_[i] = static_cast(dx->dims()[i + 2]); - } - } - - if (dx) { - ctx.template Alloc(dx); - funcs::set_constant(ctx, dx, 0); - - switch (kernel_size_.size()) { - case 2: { - funcs::MaxPool2dWithIndexGradFunctor pool2d_backward; - pool2d_backward(ctx, - dout, - mask, - kernel_size_, - strides, - paddings_, - adaptive, - fractional, - dx); - } break; - case 3: { - funcs::MaxPool3dWithIndexGradFunctor pool3d_backward; - pool3d_backward(ctx, - dout, - mask, - kernel_size_, - strides, - paddings_, - adaptive, - fractional, - dx); - } break; - default: { - PADDLE_THROW( - errors::InvalidArgument("Pool op only supports 2D and 3D input.")); - } - } - } - } - - template - void MaxPool2dWithIndexGradKernel(const Context& ctx, - const DenseTensor& x, - const DenseTensor& mask, - const DenseTensor& dout, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - DenseTensor* dx) { - MaxPoolWithIndexGradRawKernel(ctx, - x, - mask, - dout, - kernel_size, - strides, - paddings, - global_pooling, - adaptive, - fractional, - dx); - } - - template - void MaxPool3dWithIndexGradKernel(const Context& ctx, - const DenseTensor& x, - const DenseTensor& mask, - const DenseTensor& dout, - const std::vector& kernel_size, - const std::vector& strides, - const std::vector& paddings, - bool global_pooling, - bool adaptive, - bool fractional, - DenseTensor* dx) { - MaxPoolWithIndexGradRawKernel(ctx, - x, - mask, - dout, - kernel_size, - strides, - paddings, - global_pooling, - adaptive, - fractional, - dx); - } - + void FractionalMaxPoolGradRawKernel(const Context& ctx, + const DenseTensor& x UNUSED, + const DenseTensor& mask, + const DenseTensor& dout, + const std::vector& output_size, + const std::vector& kernel_size, + float random_u, + bool return_mask, + DenseTensor* dx) ``` - 增加 `fractional` 参数,分发方法时带上 `fracional`。 - - `paddle/phi/kernels/funcs/pooling.cc` 算子 CPU 实现 ``` cpp template - class MaxPool2dWithIndexFunctor { + class FractionalMaxPool2dFunctor { public: void operator()(const CPUContext& context, const DenseTensor& input, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, + const std::vector& output_size, + const std::vector& kernel_size, + float random_u, + bool return_mask, DenseTensor* output, - DenseTensor* mask) { - ... - - int hstart = 0, hend = 0; - int wstart = 0, wend = 0; - for (int i = 0; i < batch_size; i++) { - for (int c = 0; c < output_channels; ++c) { - for (int ph = 0; ph < output_height; ++ph) { - if (adaptive) { - hstart = AdaptStartIndex(ph, input_height, output_height); - hend = AdaptEndIndex(ph, input_height, output_height); - } else if (fractional) { - // TODO(megemini) - } else { - hstart = ph * stride_height - padding_height; - hend = std::min(hstart + ksize_height, input_height); - hstart = std::max(hstart, 0); - } - for (int pw = 0; pw < output_width; ++pw) { - if (adaptive) { - wstart = AdaptStartIndex(pw, input_width, output_width); - wend = AdaptEndIndex(pw, input_width, output_width); - } else if (fractional) { - // TODO(megemini) - } else { - wstart = pw * stride_width - padding_width; - wend = std::min(wstart + ksize_width, input_width); - wstart = std::max(wstart, 0); - } + DenseTensor* mask) {}; - ... - } - } - ... - } - } - } - }; - - /* - * All tensors are in NCHW format. - * Ksize, strides, paddings are two elements. These two elements represent - * height and width, respectively. - */ template - class MaxPool2dWithIndexGradFunctor { + class FractionalMaxPool2dGradFunctor { public: void operator()(const CPUContext& context, const DenseTensor& output_grad, const DenseTensor& mask, - const std::vector& ksize UNUSED, - const std::vector& strides UNUSED, - const std::vector& paddings UNUSED, - bool adaptive UNUSED, - bool fractional UNUSED, - DenseTensor* input_grad) { - };} - - /* - * All tensors are in NCDHW format. - * Ksize, strides, paddings are three elements. These three elements represent - * depth, height and width, respectively. - */ - template - class MaxPool3dWithIndexFunctor { - public: - void operator()(const CPUContext& context, - const DenseTensor& input, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, - DenseTensor* output, - DenseTensor* mask) { - ... - - int dstart = 0, dend = 0; - int hstart = 0, hend = 0; - int wstart = 0, wend = 0; - for (int i = 0; i < batch_size; i++) { - for (int c = 0; c < output_channels; ++c) { - for (int pd = 0; pd < output_depth; ++pd) { - if (adaptive) { - dstart = AdaptStartIndex(pd, input_depth, output_depth); - dend = AdaptEndIndex(pd, input_depth, output_depth); - } else if (fractional) { - /* TODO(megemini) */ - } else { - dstart = pd * stride_depth - padding_depth; - dend = std::min(dstart + ksize_depth, input_depth); - dstart = std::max(dstart, 0); - } - for (int ph = 0; ph < output_height; ++ph) { - if (adaptive) { - hstart = AdaptStartIndex(ph, input_height, output_height); - hend = AdaptEndIndex(ph, input_height, output_height); - } else if (fractional) { - /* TODO(megemini) */ - } else { - hstart = ph * stride_height - padding_height; - hend = std::min(hstart + ksize_height, input_height); - hstart = std::max(hstart, 0); - } - for (int pw = 0; pw < output_width; ++pw) { - if (adaptive) { - wstart = AdaptStartIndex(pw, input_width, output_width); - wend = AdaptEndIndex(pw, input_width, output_width); - } else if (fractional) { - // TODO(megemini) - } else { - wstart = pw * stride_width - padding_width; - wend = std::min(wstart + ksize_width, input_width); - wstart = std::max(wstart, 0); - } - - ... - } - } - ... - } - } - } - };} - - /* - * All tensors are in NCDHW format. - * Ksize, strides, paddings are three elements. These three elements represent - * depth, height and width, respectively. - */ - template - class MaxPool3dWithIndexGradFunctor { - public: - void operator()(const CPUContext& context, - const DenseTensor& output_grad, - const DenseTensor& mask, - const std::vector& ksize UNUSED, - const std::vector& strides UNUSED, - const std::vector& paddings UNUSED, - bool adaptive UNUSED, - bool fractional UNUSED, - DenseTensor* input_grad) { - };} - + const std::vector& output_size UNUSED, + const std::vector& kernel_size UNUSED, + float random_u UNUSED, + bool return_mask UNUSED, + DenseTensor* input_grad) {}; ``` 这里实现主要的 cpu 算子的逻辑(正向与反向),通过 `fractional` 参数生成池化序列,主要逻辑与 `adaptive` 相似。 - + 这里没有 `data_format` 参数的设计,建议后续能够统一 `poolNd` 与 `max_poolNd` 的算子实现。 另外,这里需要再增加一个 `0 < random < 1` 的随机数,以生成 `伪` 随机池化序列,这个随机数需要可以通过 `paddle.seed` 固定住。 @@ -1526,175 +1058,67 @@ v2.1: 修改接口签名 ``` cpp template - __global__ void KernelMaxPool2dWithIdx(const int nthreads, - const T1* input_data, - const int channels, - const int input_height, - const int input_width, - const int output_height, - const int output_width, - const int ksize_height, - const int ksize_width, - const int stride_height, - const int stride_width, - const int padding_height, - const int padding_width, - bool adaptive, - bool fractional, - T1* output_data, - T2* mask_data, - FastDivModForPooling divmods) - - template - __global__ void KernelMaxPool2DWithIdxGrad(const int nthreads, - const T1* output_grad, - const T2* mask_data, + __global__ void FractionalKernelMaxPool2d(const int ncd, + const T1* input_data, const int channels, const int input_height, const int input_width, const int output_height, const int output_width, - const int ksize_height, - const int ksize_width, - const int stride_height, - const int stride_width, - const int padding_height, - const int padding_width, - bool adaptive, - bool fractional, - T1* input_grad, - FastDivModForPooling divmods) - /* - * All tensors are in NCHW format. - * Ksize, strides, paddings are two elements. These two elements represent - * height and width, respectively. - */ + const int pool_height, + const int pool_width, + float random_u, + uint64_t seed, + uint64_t offset, + T1* output_data, + T2* mask_data, + FastDivModForPooling divmods) + template - class MaxPool2dWithIndexFunctor { - public: - void operator()(const phi::GPUContext& context, - const DenseTensor& input, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, - DenseTensor* output, - DenseTensor* mask) - }; - - /* - * All tensors are in NCHW format. - * Ksize, strides, paddings are two elements. These two elements represent - * height and width, respectively. - */ + __global__ void FractionalKernelMaxPool2dGrad(const int ncd, + const T1* output_grad, + const T2* mask_data, + const int channels, + const int input_height, + const int input_width, + const int output_height, + const int output_width, + const int pool_height, + const int pool_width, + float random_u, + uint64_t seed, + uint64_t offset, + T1* input_grad, + FastDivModForPooling divmods) + template - class MaxPool2dWithIndexGradFunctor { - public: - void operator()(const phi::GPUContext& context, - const DenseTensor& output_grad, - const DenseTensor& mask, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, - DenseTensor* input_grad) - }; - - - template - __global__ void KernelMaxPool3DWithIdx(const int ncd, - const T1* input_data, - const int channels, - const int input_depth, - const int input_height, - const int input_width, - const int output_depth, - const int output_height, - const int output_width, - const int ksize_depth, - const int ksize_height, - const int ksize_width, - const int stride_depth, - const int stride_height, - const int stride_width, - const int padding_depth, - const int padding_height, - const int padding_width, - bool adaptive, - bool fractional, - T1* output_data, - T2* mask_data, - FastDivModForPooling3D divmods_output) - - template - __global__ void KernelMaxPool3DWithIdxGrad( - const int ncd, - const T1* output_grad, - const T2* mask, - const int channels, - const int input_depth, - const int input_height, - const int input_width, - const int output_depth, - const int output_height, - const int output_width, - const int ksize_depth, - const int ksize_height, - const int ksize_width, - const int stride_depth, - const int stride_height, - const int stride_width, - const int padding_depth, - const int padding_height, - const int padding_width, - bool adaptive, - bool fractional, - T1* input_grad, - FastDivModForPooling3D divmods_output) - - /* - * All tensors are in NCDHW format. - * Ksize, strides, paddings are three elements. These three elements represent - * depth, height and width, respectively. - */ - template - class MaxPool3dWithIndexFunctor { + class FractionalMaxPool2dFunctor { public: void operator()(const phi::GPUContext& context, const DenseTensor& input, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, + const std::vector& output_size, + const std::vector& kernel_size, + float random_u, + bool return_mask, DenseTensor* output, - DenseTensor* mask) + DenseTensor* mask) {}; }; - /* - * All tensors are in NCDHW format. - * Ksize, strides, paddings are three elements. These three elements represent - * depth, height and width, respectively. - */ template - class MaxPool3dWithIndexGradFunctor { + class FractionalMaxPool2dGradFunctor { public: void operator()(const phi::GPUContext& context, const DenseTensor& output_grad, const DenseTensor& mask, - const std::vector& ksize, - const std::vector& strides, - const std::vector& paddings, - bool adaptive, - bool fractional, - DenseTensor* input_grad) + const std::vector& output_size, + const std::vector& kernel_size, + float random_u, + bool return_mask, + DenseTensor* input_grad) {}; }; - ``` - 主要逻辑与 CPU 算子类似,这里不再赘述,有一个需要单独指出的是,PR:https://github.com/PaddlePaddle/Paddle/pull/45959 中,单独针对 `AdaptiveKernelMaxPool2dWithIdx` 做了优化,本次设计方案暂不进行优化方面的设计。 + 主要逻辑与 CPU 算子类似,这里不再赘述,有一个需要单独指出的是,PR: 中,单独针对 `AdaptiveKernelMaxPool2dWithIdx` 做了优化,本次设计方案暂不进行优化方面的设计。 ### 池化序列的生成方法 @@ -1928,10 +1352,12 @@ alpha is 1.28 u is 0.8 max u is 1 ``` 可以看到 + - adaptive 的池化序列为 `2...3...` 的样式,fractional 的池化序列为 `1...2...` 的样式。 - adaptive 的池化序列存在 index 交叉,而 fractional 不存在交叉。 另外: + - `FractionalStrartIndex` 和 `FractionalEndIndex` 需要减去 `1`,因为根据论文中的算法要求,使用 `ceil`,将使 index 从 `1` 开始,所以这里需要减去 `1`。 - `hend = std::min(hend, input_height);` 这里需要与 input 比对取小值,同样是由于 `ceil` 导致。