Torchscript support (#159)

* support torchscript * add nms * add torchscript configs and update deploy process and dump-info * typescript -> torchscript * add torchscript custom extension support * add ts custom ops again * support mmseg unet * [WIP] add optimizer for torchscript (#119) * add passes * add python api * Torchscript optimizer python api (#121) * add passes * add python api * use python api instead of executable * Merge Master, update optimizer (#151) * [Feature] add yolox ncnn (#29) * add yolox ncnn * add ncnn android performance of yolox * add ut * fix lint * fix None bugs for ncnn * test codecov * test codecov * add device * fix yapf * remove if-else for img shape * use channelshuffle optimize * change benchmark after channelshuffle * fix yapf * fix yapf * fuse continuous reshape * fix static shape deploy * fix code * drop pad * only static shape * fix static * fix docstring * Added mask overlay to output image, changed fprintf info messages to … (#55) * Added mask overlay to output image, changed fprintf info messages to stdout * Improved box filtering (filter area/score), make sure roi coordinates stay within bounds * clang-format * Support UNet in mmseg (#77) * Repeatdataset in train has no CLASSES & PALETTE * update result for unet * update docstring for mmdet * remove ppl for unet in docs * fix ort wrap about input type (#81) * Fix memleak (#86) * delete [] * fix build error when enble MMDEPLOY_ACTIVE_LEVEL * fix lint * [Doc] Nano benchmark and tutorial (#71) * add cls benchmark * add nano zh-cn benchmark and en tutorial * add device row * add doc path to index.rst * fix typo * [Fix] fix missing deploy_core (#80) * fix missing deploy_core * mv flag to demo * target link * [Docs] Fix links in Chinese doc (#84) * Fix docs in Chinese link * Fix links * Delete symbolic link and add links to html * delete files * Fix link * [Feature] Add docker files (#67) * add gpu and cpu dockerfile * fix lint * fix cpu docker and remove redundant * use pip instead * add build arg and readme * fix grammar * update readme * add chinese doc for dockerfile and add docker build to build.md * grammar * refine dockerfiles * add FAQs * update Dpplcv_DIR for SDK building * remove mmcls * add sdk demos * fix typo and lint * update FAQs * [Fix]fix check_env (#101) * fix check_env * update * Replace convert_syncbatchnorm in mmseg (#93) * replace convert_syncbatchnorm with revert_sync_batchnorm from mmcv * change logger * [Doc] Update FAQ for TensorRT (#96) * update FAQ * comment * [Docs]: Update doc for openvino installation (#102) * fix docs * fix docs * fix docs * fix mmcv version * fix docs * rm blank line * simplify non batch nms (#99) * [Enhacement] Allow test.py to save evaluation results (#108) * Add log file * Delete debug code * Rename logger * resolve comments * [Enhancement] Support mmocr v0.4+ (#115) * support mmocr v0.4+ * 0.4.0 -> 0.4.1 * fix onnxruntime wrapper for gpu inference (#123) * fix ncnn wrapper for ort-gpu * resolve comment * fix lint * Fix typo (#132) * lock mmcls version (#131) * [Enhancement] upgrade isort in pre-commit config (#141) * [Enhancement] upgrade isort in pre-commit config by refering to mmflow pr #87 * fix lint * remove .isort.cfg and put its known_third_party to setup.cfg * Fix ci for mmocr (#144) * fix mmocr unittests * remove useless * lock mmdet maximum version to 2.20 * pip install -U numpy * Fix capture_output (#125) Co-authored-by: hanrui1sensetime <83800577+hanrui1sensetime@users.noreply.github.com> Co-authored-by: Johannes L <tehkillerbee@users.noreply.github.com> Co-authored-by: RunningLeon <mnsheng@yeah.net> Co-authored-by: VVsssssk <88368822+VVsssssk@users.noreply.github.com> Co-authored-by: lvhan028 <lvhan_028@163.com> Co-authored-by: AllentDan <41138331+AllentDan@users.noreply.github.com> Co-authored-by: Yifan Zhou <singlezombie@163.com> Co-authored-by: 杨培文 (Yang Peiwen) <915505626@qq.com> Co-authored-by: Semyon Bevzyuk <semen.bevzuk@gmail.com> * configs for all tasks * use torchvision roi align * remote unnecessary code * fix ut * fix ut * export * det dynamic * det dynamic * add ut * fix ut * add ut and docs * fix ut * skip torchscript ut if no ops available * add torchscript option to build.md * update benchmark and resolve comments * resolve conflicts * rename configs * fix mrcnn cuda test * remove useless * add version requirements to docs and comments to codes * enable empty image exporting for torchscript and accelerate ORT inference for MRCNN * rebase * update example for torchscript.md * update FAQs for torchscript.md * resolve comments * only use torchvision roi_align for torchscript * fix ut * use torchvision roi align when pool model is avg * resolve comments Co-authored-by: grimoire <streetyao@live.com> Co-authored-by: grimoire <yaoqian@sensetime.com> Co-authored-by: hanrui1sensetime <83800577+hanrui1sensetime@users.noreply.github.com> Co-authored-by: Johannes L <tehkillerbee@users.noreply.github.com> Co-authored-by: RunningLeon <mnsheng@yeah.net> Co-authored-by: VVsssssk <88368822+VVsssssk@users.noreply.github.com> Co-authored-by: lvhan028 <lvhan_028@163.com> Co-authored-by: Yifan Zhou <singlezombie@163.com> Co-authored-by: 杨培文 (Yang Peiwen) <915505626@qq.com> Co-authored-by: Semyon Bevzyuk <semen.bevzuk@gmail.com>
open-mmlab · Mar 7, 2022 · 63a998f · 63a998f
1 parent 4944564
commit 63a998f
Show file tree

Hide file tree

Showing 64 changed files with 1,581 additions and 242 deletions.
diff --git a/configs/_base_/backends/torchscript.py b/configs/_base_/backends/torchscript.py
@@ -0,0 +1 @@
+backend_config = dict(type='torchscript')
diff --git a/configs/_base_/torchscript_config.py b/configs/_base_/torchscript_config.py
@@ -0,0 +1,6 @@
+ir_config = dict(
+    type='torchscript',
+    save_file='end2end.pt',
+    input_names=['input'],
+    output_names=['output'],
+    input_shape=None)
diff --git a/configs/mmcls/classification_torchscript.py b/configs/mmcls/classification_torchscript.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/torchscript_config.py', '../_base_/backends/torchscript.py'
+]
+
+ir_config = dict(input_shape=None)
+codebase_config = dict(type='mmcls', task='Classification')
diff --git a/configs/mmdet/_base_/base_instance-seg_torchscript.py b/configs/mmdet/_base_/base_instance-seg_torchscript.py
@@ -0,0 +1,4 @@
+_base_ = ['./base_torchscript.py']
+
+ir_config = dict(output_names=['dets', 'labels', 'masks'])
+codebase_config = dict(post_processing=dict(export_postprocess_mask=False))
diff --git a/configs/mmdet/_base_/base_torchscript.py b/configs/mmdet/_base_/base_torchscript.py
@@ -0,0 +1,16 @@
+_base_ = ['../../_base_/torchscript_config.py']
+
+ir_config = dict(output_names=['dets', 'labels'])
+codebase_config = dict(
+    type='mmdet',
+    task='ObjectDetection',
+    model_type='end2end',
+    post_processing=dict(
+        score_threshold=0.05,
+        confidence_threshold=0.005,  # for YOLOv3
+        iou_threshold=0.5,
+        max_output_boxes_per_class=200,
+        pre_top_k=5000,
+        keep_top_k=100,
+        background_label_id=-1,
+    ))
diff --git a/configs/mmdet/detection/detection_torchscript.py b/configs/mmdet/detection/detection_torchscript.py
@@ -0,0 +1,3 @@
+_base_ = [
+    '../_base_/base_torchscript.py', '../../_base_/backends/torchscript.py'
+]
diff --git a/configs/mmdet/instance-seg/instance-seg_torchscript.py b/configs/mmdet/instance-seg/instance-seg_torchscript.py
@@ -0,0 +1,4 @@
+_base_ = [
+    '../_base_/base_instance-seg_torchscript.py',
+    '../../_base_/backends/torchscript.py'
+]
diff --git a/configs/mmedit/super-resolution/super-resolution_torchscript.py b/configs/mmedit/super-resolution/super-resolution_torchscript.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../../_base_/torchscript_config.py',
+    '../../_base_/backends/torchscript.py'
+]
+
+ir_config = dict(input_shape=None)
+codebase_config = dict(type='mmedit', task='SuperResolution')
diff --git a/configs/mmocr/text-detection/text-detection_torchscript.py b/configs/mmocr/text-detection/text-detection_torchscript.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../../_base_/torchscript_config.py',
+    '../../_base_/backends/torchscript.py'
+]
+
+ir_config = dict(input_shape=None)
+codebase_config = dict(type='mmocr', task='TextDetection')
diff --git a/configs/mmocr/text-recognition/text-recognition_torchscript.py b/configs/mmocr/text-recognition/text-recognition_torchscript.py
@@ -0,0 +1,7 @@
+_base_ = [
+    '../../_base_/torchscript_config.py',
+    '../../_base_/backends/torchscript.py'
+]
+
+ir_config = dict(input_shape=None)
+codebase_config = dict(type='mmocr', task='TextRecognition')
diff --git a/configs/mmseg/segmentation_torchscript.py b/configs/mmseg/segmentation_torchscript.py
@@ -0,0 +1,6 @@
+_base_ = [
+    '../_base_/torchscript_config.py', '../_base_/backends/torchscript.py'
+]
+
+ir_config = dict(input_shape=None)
+codebase_config = dict(type='mmseg', task='Segmentation')
diff --git a/csrc/backend_ops/CMakeLists.txt b/csrc/backend_ops/CMakeLists.txt
@@ -30,3 +30,9 @@ if ("ncnn" IN_LIST MMDEPLOY_TARGET_BACKENDS)
     message(STATUS "Build NCNN custom ops")
     add_subdirectory(ncnn)
 endif ()
+
+# build TorchScript ops
+if ("torchscript" IN_LIST MMDEPLOY_TARGET_BACKENDS)
+  message(STATUS "Build torchsciprt custom ops")
+  add_subdirectory(torchscript)
+endif ()
diff --git a/csrc/backend_ops/common/modulated_deform_conv/common_cuda_helper.cuh b/csrc/backend_ops/common/modulated_deform_conv/common_cuda_helper.cuh
@@ -0,0 +1,94 @@
+// Copyright (c) OpenMMLab. All rights reserved.
+#ifndef COMMON_CUDA_HELPER
+#define COMMON_CUDA_HELPER
+
+#include <cublas_v2.h>
+#include <cuda.h>
+
+#include <algorithm>
+
+#define CUDA_1D_KERNEL_LOOP(i, n) \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); i += blockDim.x * gridDim.x)
+
+#define THREADS_PER_BLOCK 512
+
+#define DIVUP(m, n) ((m) / (n) + ((m) % (n) > 0))
+inline int GET_BLOCKS(const int N) {
+  int optimal_block_num = DIVUP(N, THREADS_PER_BLOCK);
+  int max_block_num = 4096;
+  return std::min(optimal_block_num, max_block_num);
+}
+
+#define cudaCheckError()                                                               \
+  {                                                                                    \
+    cudaError_t e = cudaGetLastError();                                                \
+    if (e != cudaSuccess) {                                                            \
+      printf("Cuda failure %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(e)); \
+      exit(0);                                                                         \
+    }                                                                                  \
+  }
+
+/**
+ * Returns a view of the original tensor with its dimensions permuted.
+ *
+ * @param[out] dst pointer to the destination tensor
+ * @param[in] src pointer to the source tensor
+ * @param[in] src_size shape of the src tensor
+ * @param[in] permute The desired ordering of dimensions
+ * @param[in] src_dim dim of src tensor
+ * @param[in] stream cuda stream handle
+ */
+template <class scalar_t>
+void memcpyPermute(scalar_t* dst, const scalar_t* src, int* src_size, int* permute, int src_dim,
+                   cudaStream_t stream = 0);
+
+template <typename scalar_t>
+cublasStatus_t cublasGemmWrap(cublasHandle_t handle, cublasOperation_t transa,
+                              cublasOperation_t transb, int m, int n, int k, const scalar_t* alpha,
+                              const scalar_t* A, int lda, const scalar_t* B, int ldb,
+                              const scalar_t* beta, scalar_t* C, int ldc);
+
+template <typename scalar_t>
+__device__ scalar_t bilinear_interpolate(const scalar_t* input, const int height, const int width,
+                                         scalar_t y, scalar_t x) {
+  // deal with cases that inverse elements are out of feature map boundary
+  if (y < -1.0 || y > height || x < -1.0 || x > width) return 0;
+
+  if (y <= 0) y = 0;
+  if (x <= 0) x = 0;
+
+  int y_low = (int)y;
+  int x_low = (int)x;
+  int y_high;
+  int x_high;
+
+  if (y_low >= height - 1) {
+    y_high = y_low = height - 1;
+    y = (scalar_t)y_low;
+  } else {
+    y_high = y_low + 1;
+  }
+
+  if (x_low >= width - 1) {
+    x_high = x_low = width - 1;
+    x = (scalar_t)x_low;
+  } else {
+    x_high = x_low + 1;
+  }
+
+  scalar_t ly = y - y_low;
+  scalar_t lx = x - x_low;
+  scalar_t hy = 1. - ly, hx = 1. - lx;
+  // do bilinear interpolation
+  scalar_t v1 = input[y_low * width + x_low];
+  scalar_t v2 = input[y_low * width + x_high];
+  scalar_t v3 = input[y_high * width + x_low];
+  scalar_t v4 = input[y_high * width + x_high];
+  scalar_t w1 = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;
+
+  scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
+
+  return val;
+}
+
+#endif  // COMMON_CUDA_HELPER
diff --git a/csrc/backend_ops/common/modulated_deform_conv/modulated_deform_conv_cpu.h b/csrc/backend_ops/common/modulated_deform_conv/modulated_deform_conv_cpu.h
@@ -0,0 +1,82 @@
+#include <cmath>
+#include <cstdint>
+
+template <typename T>
+T bilinear_interpolate_2d(const T *src, const int64_t src_h, const int64_t src_w, const T h,
+                          const T w) {
+  if (h <= -1 || src_h <= h || w <= -1 || src_w <= w) {
+    return 0;
+  }
+
+  int64_t h_low = floor(h);
+  int64_t w_low = floor(w);
+  int64_t h_high = h_low + 1;
+  int64_t w_high = w_low + 1;
+
+  T lh = h - h_low;
+  T lw = w - w_low;
+  T hh = 1 - lh;
+  T hw = 1 - lw;
+
+  T v1 = 0;
+  if (h_low >= 0 && w_low >= 0) v1 = src[h_low * src_w + w_low];
+  T v2 = 0;
+  if (h_low >= 0 && w_high <= src_w - 1) v2 = src[h_low * src_w + w_high];
+  T v3 = 0;
+  if (h_high <= src_h - 1 && w_low >= 0) v3 = src[h_high * src_w + w_low];
+  T v4 = 0;
+  if (h_high <= src_h - 1 && w_high <= src_w - 1) v4 = src[h_high * src_w + w_high];
+
+  T w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;
+
+  T val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
+  return val;
+}
+
+// output: (channels * kernel_h * kernel_w, dst_h * dst_w)
+template <typename T>
+void deformable_im2col_2d(const T *input, const T *offset, const T *mask, const int64_t src_h,
+                          const int64_t src_w, const int64_t kernel_h, const int64_t kernel_w,
+                          const int64_t pad_h, const int64_t pad_w, const int64_t stride_h,
+                          const int64_t stride_w, const int64_t dilation_h,
+                          const int64_t dilation_w, const int64_t channels,
+                          const int64_t offset_groups, const int64_t dst_h, const int64_t dst_w,
+                          const bool use_mask, T *columns) {
+  const int64_t workload = channels * dst_h * dst_w;
+  for (int64_t index = 0; index != workload; ++index) {
+    const int64_t ow = index % dst_w;
+    const int64_t oh = (index / dst_w) % dst_h;
+    const int64_t ic = index / (dst_w * dst_h);
+    const int64_t oc = ic * kernel_h * kernel_w;
+
+    int64_t c_per_offset_grp = channels / offset_groups;
+    const int64_t grp_idx = ic / c_per_offset_grp;
+
+    auto columns_ptr = columns + (oc * (dst_h * dst_w) + oh * dst_w + ow);
+    auto input_ptr = input + ic * (src_h * src_w);
+    auto offset_ptr = offset + grp_idx * 2 * kernel_h * kernel_w * dst_h * dst_w;
+    auto mask_ptr = mask;
+    if (use_mask) {
+      mask_ptr += grp_idx * kernel_h * kernel_w * dst_h * dst_w;
+    }
+
+    for (int64_t kh = 0; kh < kernel_h; ++kh) {
+      for (int64_t kw = 0; kw < kernel_w; ++kw) {
+        const int64_t mask_idx = kh * kernel_w + kw;
+        const int64_t offset_idx = 2 * mask_idx;
+
+        T mask_value = 1;
+        if (use_mask) {
+          mask_value = mask_ptr[mask_idx * (dst_h * dst_w) + oh * dst_w + ow];
+        }
+
+        const T offset_h = offset_ptr[offset_idx * (dst_h * dst_w) + oh * dst_w + ow];
+        const T offset_w = offset_ptr[(offset_idx + 1) * (dst_h * dst_w) + oh * dst_w + ow];
+        const T ih = (oh * stride_h - pad_h) + kh * dilation_h + offset_h;
+        const T iw = (ow * stride_w - pad_w) + kw * dilation_w + offset_w;
+        *columns_ptr = mask_value * bilinear_interpolate_2d<T>(input_ptr, src_h, src_w, ih, iw);
+        columns_ptr += dst_h * dst_w;
+      }
+    }
+  }
+}
diff --git a/...conv/trt_modulated_deform_conv_kernel.hpp → ...eform_conv/modulated_deform_conv_cuda.cuh b/...conv/trt_modulated_deform_conv_kernel.hpp → ...eform_conv/modulated_deform_conv_cuda.cuh
@@ -68,7 +68,7 @@
 
 #include <float.h>
 
-#include "common_cuda_helper.hpp"
+#include "common_cuda_helper.cuh"
 
 template <typename T>
 __device__ T dmcn_im2col_bilinear(const T *input, const int data_width, const int height,

diff --git a/csrc/backend_ops/onnxruntime/CMakeLists.txt b/csrc/backend_ops/onnxruntime/CMakeLists.txt
@@ -14,6 +14,7 @@ mmdeploy_export(${PROJECT_NAME}_obj)
 target_include_directories(${PROJECT_NAME}_obj PUBLIC
         $<BUILD_INTERFACE:${ONNXRUNTIME_DIR}/include>
         $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/common>
+        $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../common>
         $<BUILD_INTERFACE:${CMAKE_SOURCE_DIR}/csrc>)
 target_link_directories(${PROJECT_NAME}_obj PUBLIC
         ${ONNXRUNTIME_DIR}/lib)

diff --git a/csrc/backend_ops/onnxruntime/modulated_deform_conv/modulated_deform_conv.cpp b/csrc/backend_ops/onnxruntime/modulated_deform_conv/modulated_deform_conv.cpp
@@ -4,88 +4,11 @@
 #include <cmath>
 #include <vector>
 
+#include "modulated_deform_conv/modulated_deform_conv_cpu.h"
 #include "ort_utils.h"
 
 namespace mmdeploy {
 
-float bilinear_interpolate_2d(const float *src, const int64_t src_h, const int64_t src_w,
-                              const float h, const float w) {
-  if (h <= -1 || src_h <= h || w <= -1 || src_w <= w) {
-    return 0;
-  }
-
-  int64_t h_low = floor(h);
-  int64_t w_low = floor(w);
-  int64_t h_high = h_low + 1;
-  int64_t w_high = w_low + 1;
-
-  float lh = h - h_low;
-  float lw = w - w_low;
-  float hh = 1 - lh;
-  float hw = 1 - lw;
-
-  float v1 = 0;
-  if (h_low >= 0 && w_low >= 0) v1 = src[h_low * src_w + w_low];
-  float v2 = 0;
-  if (h_low >= 0 && w_high <= src_w - 1) v2 = src[h_low * src_w + w_high];
-  float v3 = 0;
-  if (h_high <= src_h - 1 && w_low >= 0) v3 = src[h_high * src_w + w_low];
-  float v4 = 0;
-  if (h_high <= src_h - 1 && w_high <= src_w - 1) v4 = src[h_high * src_w + w_high];
-
-  float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;
-
-  float val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);
-  return val;
-}
-
-// output: (channels * kernel_h * kernel_w, dst_h * dst_w)
-void deformable_im2col_2d(const float *input, const float *offset, const float *mask,
-                          const int64_t src_h, const int64_t src_w, const int64_t kernel_h,
-                          const int64_t kernel_w, const int64_t pad_h, const int64_t pad_w,
-                          const int64_t stride_h, const int64_t stride_w, const int64_t dilation_h,
-                          const int64_t dilation_w, const int64_t channels,
-                          const int64_t offset_groups, const int64_t dst_h, const int64_t dst_w,
-                          const bool use_mask, float *columns) {
-  const int64_t workload = channels * dst_h * dst_w;
-  for (int64_t index = 0; index != workload; ++index) {
-    const int64_t ow = index % dst_w;
-    const int64_t oh = (index / dst_w) % dst_h;
-    const int64_t ic = index / (dst_w * dst_h);
-    const int64_t oc = ic * kernel_h * kernel_w;
-
-    int64_t c_per_offset_grp = channels / offset_groups;
-    const int64_t grp_idx = ic / c_per_offset_grp;
-
-    auto columns_ptr = columns + (oc * (dst_h * dst_w) + oh * dst_w + ow);
-    auto input_ptr = input + ic * (src_h * src_w);
-    auto offset_ptr = offset + grp_idx * 2 * kernel_h * kernel_w * dst_h * dst_w;
-    auto mask_ptr = mask;
-    if (use_mask) {
-      mask_ptr += grp_idx * kernel_h * kernel_w * dst_h * dst_w;
-    }
-
-    for (int64_t kh = 0; kh < kernel_h; ++kh) {
-      for (int64_t kw = 0; kw < kernel_w; ++kw) {
-        const int64_t mask_idx = kh * kernel_w + kw;
-        const int64_t offset_idx = 2 * mask_idx;
-
-        float mask_value = 1;
-        if (use_mask) {
-          mask_value = mask_ptr[mask_idx * (dst_h * dst_w) + oh * dst_w + ow];
-        }
-
-        const float offset_h = offset_ptr[offset_idx * (dst_h * dst_w) + oh * dst_w + ow];
-        const float offset_w = offset_ptr[(offset_idx + 1) * (dst_h * dst_w) + oh * dst_w + ow];
-        const float ih = (oh * stride_h - pad_h) + kh * dilation_h + offset_h;
-        const float iw = (ow * stride_w - pad_w) + kw * dilation_w + offset_w;
-        *columns_ptr = mask_value * bilinear_interpolate_2d(input_ptr, src_h, src_w, ih, iw);
-        columns_ptr += dst_h * dst_w;
-      }
-    }
-  }
-}
-
 void gemm_ref_fp32(const float *A, const float *B, const float *V, const float *H,
                    const int32_t trans_A, const int32_t trans_B, const int32_t M, const int32_t N,
                    const int32_t K, const float alpha, const float beta, float *Y) {
@@ -162,12 +85,12 @@ void deformable_conv2d_ref_fp32(const float *src, const float *offset, const flo
 
   for (int64_t b = 0; b < batch; ++b) {
     for (int64_t g = 0; g < group; ++g) {
-      deformable_im2col_2d(src + b * src_c * src_h * src_w + g * ic_per_gp * src_h * src_w,
-                           offset + b * offset_group * 2 * kernel_h * kernel_w * dst_h * dst_w,
-                           mask + b * offset_group * kernel_h * kernel_w * dst_h * dst_w, src_h,
-                           src_w, kernel_h, kernel_w, pad_h, pad_w, stride_h, stride_w, dilation_h,
-                           dilation_w, ic_per_gp, offset_group, dst_h, dst_w, mask != nullptr,
-                           columns);
+      deformable_im2col_2d<float>(
+          src + b * src_c * src_h * src_w + g * ic_per_gp * src_h * src_w,
+          offset + b * offset_group * 2 * kernel_h * kernel_w * dst_h * dst_w,
+          mask + b * offset_group * kernel_h * kernel_w * dst_h * dst_w, src_h, src_w, kernel_h,
+          kernel_w, pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, ic_per_gp,
+          offset_group, dst_h, dst_w, mask != nullptr, columns);
       float *dst_ptr = dst + b * dst_c * dst_h * dst_w + g * oc_per_gp * dst_h * dst_w;
       if (bias != nullptr) {
         const float *bias_ptr = bias + g * oc_per_gp;