Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add QLinearMatMul #59

Merged
merged 2 commits into from
Nov 29, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 27 additions & 2 deletions onnxruntime/contrib_ops/contrib_ops.cc
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
namespace onnxruntime {
namespace contrib {
using ::ONNX_NAMESPACE::AttributeProto;
using ::ONNX_NAMESPACE::OpSchema;
using ::ONNX_NAMESPACE::OPTIONAL;
using ::ONNX_NAMESPACE::OpSchema;

void RegisterContribSchemas() {
ONNX_CONTRIB_OPERATOR_SCHEMA(SampleOp)
Expand Down Expand Up @@ -135,6 +135,31 @@ The linear de-quantization operator. It consumes a quantized data, a scale, a ze
The dequantization formula is y = (x - x_zero_point) * x_scale.
Scale and zero point must have same shape. They must be either scalar (per tensor) or 1-D tensor (per 'axis').)DOC");

ONNX_CONTRIB_OPERATOR_SCHEMA(QLinearMatMul)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QLinearMatMul [](start = 31, length = 13)

Do we need a GEMM version to support transpose as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm not sure, we should do the transpose in the graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on what's offered from BLAS libs. If BLAS lib has GEMM with support of transpose flag, then we should not block user from leveraging it from op level.

.SetDomain(kMSDomain)
.SinceVersion(1)
.SetDoc(R"DOC(
Matrix product that behaves like numpy.matmul: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matmul.html.
It consumes two quantized input tensors, their scales and zero points, and output's scale and zero point, and computes
the quantized output. The quantization formula is x_quantized = (x_fp32 / x_scale) + x_zero_point. For (x_fp32 / x_scale),
it computes the nearest integer value to arg (in floating-point format), rounding halfway cases away from zero.
Scale and zero point must have same shape. They must be either scalar (per tensor) or 1-D tensor (per row for a and per column for b).
If scale and zero point are 1D tensor, the number of elements of scale and zero point tensor of input 'a' and output 'y'
should be equal to the number of rows of input 'a', and the number of elements of scale and zero point tensor of input 'b'
should be equal to the number of columns of input 'b'.)DOC")
.Input(0, "a", "N-dimensional quantized matrix a", "T1")
.Input(1, "a_scale", "scale of quantized input a", "tensor(float)")
.Input(2, "a_zero_point", "zero point of quantized input a", "T1")
.Input(3, "b", "N-dimensional quantized matrix b", "T2")
.Input(4, "b_scale", "scale of quantized input b", "tensor(float)")
.Input(5, "b_zero_point", "zero point of quantized input b", "T2")
.Input(6, "y_scale", "scale of quantized output y", "tensor(float)")
.Input(7, "y_zero_point", "zero point of quantized output y", "T3")
.Output(0, "y", "Quantized matrix multiply results from a * b", "T3")
.TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)"}, "Constrain input a and its zero point data types as 8-bit integer tensor")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int8 [](start = 37, length = 4)

If this op is only for int8, should it be named QLinearMatMul8? Or it would be expanded to int16/int32 in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expand it to 16/32 if needed later, other than naming it specifically.

.TypeConstraint("T2", {"tensor(int8)", "tensor(uint8)"}, "Constrain input b and its zero point data types as 8-bit integer tensor")
.TypeConstraint("T3", {"tensor(int8)", "tensor(uint8)"}, "Constrain output y and its zero point data types as 8-bit integer tensor.");

const char* auto_pad_doc =
"auto_pad must be either NOTSET, SAME_UPPER, SAME_LOWER or VALID. Where "
"default value is NOTSET, which means explicit padding is used. "
Expand Down Expand Up @@ -323,7 +348,7 @@ The integer convolution operator consumes an input tensor, a filter, and a paddi
Matrix product that behaves like numpy.matmul: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matmul.html.
The production MUST never overflow. The accumulation may overflow if and only if in 32 bits.)DOC")
.Input(0, "A", "N-dimensional matrix A", "T1")
.Input(0, "B", "N-dimensional matrix B", "T2")
.Input(1, "B", "N-dimensional matrix B", "T2")
.Output(0, "Y", "Matrix multiply results from A * B", "T3")
.TypeConstraint("T1", {"tensor(int8)", "tensor(uint8)"}, "Constrain input A data types as 8-bit integer tensor")
.TypeConstraint("T2", {"tensor(int8)", "tensor(uint8)"}, "Constrain input B data types as 8-bit integer tensor")
Expand Down