Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom operators cummax and cummin for onnxruntime #1010

Merged
merged 17 commits into from
May 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions docs/onnxruntime_custom_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,18 @@
- [Inputs](#inputs-4)
- [Outputs](#outputs-4)
- [Type Constraints](#type-constraints-4)
- [cummax](#cummax)
- [Description](#description-5)
- [Parameters](#parameters-5)
- [Inputs](#inputs-5)
- [Outputs](#outputs-5)
- [Type Constraints](#type-constraints-5)
- [cummin](#cummin)
- [Description](#description-6)
- [Parameters](#parameters-6)
- [Inputs](#inputs-6)
- [Outputs](#outputs-6)
- [Type Constraints](#type-constraints-6)

<!-- TOC -->

Expand Down Expand Up @@ -207,3 +219,67 @@ Perform CornerPool on `input` features. Read [CornerNet -- Detecting Objects as
### Type Constraints

- T:tensor(float32)

## cummax

### Description

Returns a tuple (`values`, `indices`) where `values` is the cumulative maximum elements of `input` in the dimension `dim`. And `indices` is the index location of each maximum value found in the dimension `dim`. Read [torch.cummax](https://pytorch.org/docs/stable/generated/torch.cummax.html) for more details.

### Parameters

| Type | Parameter | Description |
| ------- | --------------- | ---------------------------------------------------------------- |
| `int` | `dim` | the dimension to do the operation over |

### Inputs

<dl>
<dt><tt>input</tt>: T</dt>
<dd>The input tensor with various shapes. Tensor with empty element is also supported.</dd>
</dl>

### Outputs

<dl>
<dt><tt>output</tt>: T</dt>
<dd>Output the cumulative maximum elements of `input` in the dimension `dim`, with the same shape and dtype as `input`.</dd>
<dt><tt>indices</tt>: tensor(int64)</dt>
<dd>Output the index location of each cumulative maximum value found in the dimension `dim`, with the same shape as `input`.</dd>
</dl>

### Type Constraints

- T:tensor(float32)

## cummin

### Description

Returns a tuple (`values`, `indices`) where `values` is the cumulative minimum elements of `input` in the dimension `dim`. And `indices` is the index location of each minimum value found in the dimension `dim`. Read [torch.cummin](https://pytorch.org/docs/stable/generated/torch.cummin.html) for more details.

### Parameters

| Type | Parameter | Description |
| ------- | --------------- | ---------------------------------------------------------------- |
| `int` | `dim` | the dimension to do the operation over |

### Inputs

<dl>
<dt><tt>input</tt>: T</dt>
<dd>The input tensor with various shapes. Tensor with empty element is also supported.</dd>
</dl>

### Outputs

<dl>
<dt><tt>output</tt>: T</dt>
<dd>Output the cumulative minimum elements of `input` in the dimension `dim`, with the same shape and dtype as `input`.</dd>
<dt><tt>indices</tt>: tensor(int64)</dt>
<dd>Output the index location of each cumulative minimum value found in the dimension `dim`, with the same shape as `input`.</dd>
</dl>

### Type Constraints

- T:tensor(float32)
8 changes: 6 additions & 2 deletions docs/onnxruntime_op.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@
| [RoIAlign](onnxruntime_custom_ops.md#roialign) | Y | N | 1.2.5 |
| [NMS](onnxruntime_custom_ops.md#nms) | Y | N | 1.2.7 |
| [grid_sampler](onnxruntime_custom_ops.md#grid_sampler) | Y | N | master |
| [CornerPool](onnxruntime_custom_ops.md#cornerpool) | Y | N | master |
| [CornerPool](onnxruntime_custom_ops.md#cornerpool) | Y | N | master |
| [cummax](onnxruntime_custom_ops.md#cummax) | Y | N | master |
| [cummin](onnxruntime_custom_ops.md#cummin) | Y | N | master |

## How to build custom operators for ONNX Runtime

Expand Down Expand Up @@ -115,7 +117,9 @@ Take custom operator `soft_nms` for example.

## Known Issues

- None
- "RuntimeError: tuple appears in op that does not forward tuples, unsupported kind: `prim::PythonOp`."
1. Note generally `cummax` or `cummin` is exportable to ONNX as long as the torch version >= 1.5.0, since `torch.cummax` is only supported with torch >= 1.5.0. But when `cummax` or `cummin` serves as an intermediate component whose outputs is used as inputs for another modules, it's expected that torch version must be >= 1.7.0. Otherwise the above error might arise, when running exported ONNX model with onnxruntime.
2. Solution: update the torch version to 1.7.0 or higher.

## References

Expand Down
12 changes: 12 additions & 0 deletions mmcv/onnx/symbolic.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,16 @@ def grid_sampler(g,
align_corners_i=align_corners)


@parse_args('v', 'i')
def cummax(g, input, dim):
return g.op('mmcv::cummax', input, dim_i=dim, outputs=2)


@parse_args('v', 'i')
def cummin(g, input, dim):
return g.op('mmcv::cummin', input, dim_i=dim, outputs=2)


def register_extra_symbolics(opset=11):
register_op('one_hot', one_hot, '', opset)
register_op('im2col', im2col, '', opset)
Expand All @@ -421,3 +431,5 @@ def register_extra_symbolics(opset=11):
register_op('upsample_bicubic2d', upsample_bicubic2d, '', opset)
register_op('new_full', new_full, '', opset)
register_op('grid_sampler', grid_sampler, '', opset)
register_op('cummax', cummax, '', opset)
register_op('cummin', cummin, '', opset)
9 changes: 9 additions & 0 deletions mmcv/ops/corner_pool.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,15 @@ def __init__(self, mode):

def forward(self, x):
if torch.__version__ != 'parrots' and torch.__version__ >= '1.5.0':
if torch.onnx.is_in_onnx_export():
assert torch.__version__ >= '1.7.0', \
'When `cummax` serves as an intermediate component whose '\
'outputs is used as inputs for another modules, it\'s '\
'expected that pytorch version must be >= 1.7.0, '\
'otherwise Error appears like: `RuntimeError: tuple '\
'appears in op that does not forward tuples, unsupported '\
'kind: prim::PythonOp`.'

dim, flip = self.cummax_dim_flip[self.mode]
if flip:
x = x.flip(dim)
Expand Down
11 changes: 11 additions & 0 deletions mmcv/ops/csrc/onnxruntime/cpu/onnxruntime_register.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include "grid_sample.h"
#include "nms.h"
#include "ort_mmcv_utils.h"
#include "reduce_ops.h"
#include "roi_align.h"
#include "roi_align_rotated.h"
#include "soft_nms.h"
Expand All @@ -14,6 +15,8 @@ NmsOp c_NmsOp;
MMCVRoiAlignCustomOp c_MMCVRoiAlignCustomOp;
MMCVRoIAlignRotatedCustomOp c_MMCVRoIAlignRotatedCustomOp;
GridSampleOp c_GridSampleOp;
MMCVCumMaxCustomOp c_MMCVCumMaxCustomOp;
MMCVCumMinCustomOp c_MMCVCumMinCustomOp;
MMCVCornerPoolCustomOp c_MMCVCornerPoolCustomOp;

OrtStatus *ORT_API_CALL RegisterCustomOps(OrtSessionOptions *options,
Expand Down Expand Up @@ -52,5 +55,13 @@ OrtStatus *ORT_API_CALL RegisterCustomOps(OrtSessionOptions *options,
return status;
}

if (auto status = ortApi->CustomOpDomain_Add(domain, &c_MMCVCumMaxCustomOp)) {
return status;
}

if (auto status = ortApi->CustomOpDomain_Add(domain, &c_MMCVCumMinCustomOp)) {
return status;
}

return ortApi->AddCustomOpDomain(options, domain);
}
187 changes: 187 additions & 0 deletions mmcv/ops/csrc/onnxruntime/cpu/reduce_ops.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
#include "reduce_ops.h"

#include <assert.h>

#include <vector>

#include "../ort_mmcv_utils.h"

// modified from
// https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp

static inline int64_t maybe_wrap_dim(int64_t dim, int64_t ndims) {
int64_t min = -ndims;
int64_t max = ndims - 1;
assert(dim >= min && dim <= max);
if (dim < 0) dim += ndims;
return dim;
}

static inline int64_t get_dim_stride(const int64_t dim, const int64_t ndims,
const int64_t *reversed_dim_cumprod) {
return dim == ndims - 1 ? 1 : reversed_dim_cumprod[dim + 1];
}

static inline int64_t get_dim_size(const int64_t dim, const int64_t ndims,
const int64_t *reversed_dim_cumprod) {
return dim == ndims - 1
? reversed_dim_cumprod[dim]
: reversed_dim_cumprod[dim] / reversed_dim_cumprod[dim + 1];
}

template <typename T1, typename T2, typename Operation>
void cummax_cummin_helper(const T1 *input, T1 *output, T2 *indices,
const int64_t input_dim_size, const int64_t stride) {
Operation op;
T1 out = input[0];
int64_t idx = 0;
for (int64_t i = 0; i < input_dim_size; i++) {
T1 curr_elem = input[i * stride];
if (op(curr_elem, out)) {
out = curr_elem;
idx = i;
}
output[i * stride] = out;
indices[i * stride] = idx;
}
}

// modified `tensor_dim_apply3` from
// https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorDimApply.h.
// the difference is that: (1) use `reversed_dim_cumprod` for fast computing of
// tensor `size` and `stride`. (2) the same `stride` is used for input, output,
// and indices, since it's unnecessary to use separate values. currently
// `tensor_dim_apply3` is only used for `cummax` and `cummin`, according to the
// official pytorch projects: https://github.com/pytorch/pytorch.
template <typename T1, typename T2, typename Function>
void tensor_dim_apply3(const T1 *input, T1 *output, T2 *indices,
const int64_t dim, const int64_t ndims,
const int64_t *reversed_dim_cumprod, Function func) {
int dim_apply_finished = 0;
int64_t input_dim_size = get_dim_size(dim, ndims, reversed_dim_cumprod);
// the same stride is used for input, output and indices
int64_t stride = get_dim_stride(dim, ndims, reversed_dim_cumprod);
std::vector<int64_t> counter(ndims, 0);

while (!dim_apply_finished) {
// call `func` once to update output and indices
func(input, output, indices, input_dim_size, stride);
if (ndims == 1) break;
for (int64_t dim_i = 0; dim_i < ndims; dim_i++) {
if (dim_i == dim) {
if (dim_i == (ndims - 1)) {
dim_apply_finished = 1;
break;
}
continue;
}
counter[dim_i]++;

// the same stride is used for input, output, and indices
int64_t stride_dim_i = get_dim_stride(dim_i, ndims, reversed_dim_cumprod);
input += stride_dim_i;
output += stride_dim_i;
indices += stride_dim_i;

if (counter[dim_i] == get_dim_size(dim_i, ndims, reversed_dim_cumprod)) {
if (dim_i == ndims - 1) {
dim_apply_finished = 1;
break;
} else {
input -= counter[dim_i] * stride_dim_i;
output -= counter[dim_i] * stride_dim_i;
indices -= counter[dim_i] * stride_dim_i;
counter[dim_i] = 0;
}
} else {
break;
} // if
} // for
} // while
}

template <typename T1, typename T2, typename Operation>
void CumMax_CumMin_CPU(const T1 *input, T1 *output, T2 *indices,
int64_t *reversed_dim_cumprod, const int64_t dim,
const OrtTensorDimensions &out_dimensions) {
// calculate numel
const int64_t ndims = out_dimensions.size();
int64_t numel = 1;
for (int64_t dim_i = 0; dim_i < ndims; dim_i++) {
numel *= out_dimensions.data()[dim_i];
}

// cummax is only applied to input which is non-zero dim and non-empty
if (numel) {
// compute the cumulative production on dimension size,
// which is then used for computing the stride or size of a specific `dim`.
reversed_dim_cumprod[ndims - 1] = out_dimensions.data()[ndims - 1];
for (int64_t dim_i = ndims - 2; dim_i >= 0; dim_i--) {
reversed_dim_cumprod[dim_i] =
reversed_dim_cumprod[dim_i + 1] * out_dimensions.data()[dim_i];
}

// do cummax or cummin besed on `Operation` type
tensor_dim_apply3<float, int64_t>(
input, output, indices, dim, ndims, reversed_dim_cumprod,
cummax_cummin_helper<float, int64_t, Operation>);
}
}

void MMCVCumMaxKernel::Compute(OrtKernelContext *context) {
// get input
const OrtValue *input = ort_.KernelContext_GetInput(context, 0);
const float *input_data =
reinterpret_cast<const float *>(ort_.GetTensorData<float>(input));

// get ouput
OrtTensorDimensions out_dimensions(ort_, input);
OrtValue *output = ort_.KernelContext_GetOutput(
context, 0, out_dimensions.data(), out_dimensions.size());
float *output_data = ort_.GetTensorMutableData<float>(output);
OrtValue *indices = ort_.KernelContext_GetOutput(
context, 1, out_dimensions.data(), out_dimensions.size());
int64_t *indices_data = ort_.GetTensorMutableData<int64_t>(indices);

// allocate tmp memory for computing the cumulative production on dimension
// size
const int64_t ndims = out_dimensions.size();
assert(ndims > 0);
int64_t *reversed_dim_cumprod =
(int64_t *)allocator_.Alloc(sizeof(int64_t) * ndims);

// dim should be wrapped if it's negative (e.g. -1)
const int64_t dim = maybe_wrap_dim(dim_, ndims);
CumMax_CumMin_CPU<float, int64_t, std::greater_equal<float>>(
input_data, output_data, indices_data, reversed_dim_cumprod, dim,
out_dimensions);
}

void MMCVCumMinKernel::Compute(OrtKernelContext *context) {
// get input
const OrtValue *input = ort_.KernelContext_GetInput(context, 0);
const float *input_data =
reinterpret_cast<const float *>(ort_.GetTensorData<float>(input));

// get ouput
OrtTensorDimensions out_dimensions(ort_, input);
OrtValue *output = ort_.KernelContext_GetOutput(
context, 0, out_dimensions.data(), out_dimensions.size());
float *output_data = ort_.GetTensorMutableData<float>(output);
OrtValue *indices = ort_.KernelContext_GetOutput(
context, 1, out_dimensions.data(), out_dimensions.size());
int64_t *indices_data = ort_.GetTensorMutableData<int64_t>(indices);

// allocate tmp memory for computing the cumulative production on dimension
// size
const int64_t ndims = out_dimensions.size();
assert(ndims > 0);
int64_t *reversed_dim_cumprod =
(int64_t *)allocator_.Alloc(sizeof(int64_t) * ndims);

// dim should be wrapped if it's negative (e.g. -1)
const int64_t dim = maybe_wrap_dim(dim_, ndims);
CumMax_CumMin_CPU<float, int64_t, std::less_equal<float>>(
input_data, output_data, indices_data, reversed_dim_cumprod, dim,
out_dimensions);
}
Loading