[RFC] Generalize compute to tensor region #1485

ZihengJiang · 2018-07-25T03:25:47Z

Target:

Avoid verification in the original Tensorize, which we found it is not easy while detecting some patterns, like specifying abosolute/relative indexing.
Achieve better encapsulation. Vendors can build external package for their hardware, and user do not need to know the detail while writing their TVM declaration.
Propose a way to represent a block in tensor, which can be scheduled as an unit.

Motivation

The current tensorize method use TensorIntrin as a contract. While lowering, it will compare the body of the intrinsic‘s declaration and the body of the original declaration. After verifying the both are absolutely identical, the body of the original declaration will be replaced by intrinsic.

However, we found that it is not easy to guarantee/express the identity in some situation, especially involving the value of an index. This problem is related to the implementation of tensorize: normally, tensorize will replace the index in intrinsic with the index in the original declaration, so that the lowered code can be perfectly matched. For example, if we want to tensorize an addition intrinsic:

# intrinsic declaration
for k in 0...16
    A[k] = B[k] + C[k]

# original declaration
for i in 0...32
    for j in 0...16
        A[i*16+j] = B[i*16+j] + C[i*16+j]

We will replace k in intrinsic declaration with i*16+j, then to compare the ir whether is identical. However, if we require the value of index, things will change. The verification will fail unless we express absolute indexing and relative indexing clearly.

Design

Instead of proposing a way to express absolute indexing, we would like to come up with generalizing tvm.compute to the region of tensor. That means we can view a tensor intrinsic as an operation between regions, just like + as an operation between elements.

region: following numpy, we use slice to express a region in tensor: A[i, 0:16]
tensor_op: different with extern operation, we hope that we can keep the ability to schedule the part out of tensorized region. It means there are two kinds of axes: 1. axes in tensorized region, which cannot be scheduled; 2. axes out of the tensorized region, which can be scheduled through split, reorder, etc.

With this generalization, we can use tensor intrinsic in declaration directly, so verification is no longer needed. Also, this way provides better encapsulation, user do not need to know the detail of tensor intrinsic.

Interface

def tensor_op(out_dims,
              in_dims,
              finputs,
              tensor_intrin,
              reduce_axis=[],
              name='tensor_op',
              tag=""):
    """Construct new tensors with intrinsic.

    Parameters
    ----------
    out_dims: tuple
        The dimensions out of the tensorized region, which can be scheduled through `reorder`, `split`.

    in_dims: tuple
        The dimensions inside of the tensorized region, which cannot be manipulated.

    finputs: lambda function of out_dims -> list of TensorSlice
        Specifies involved regions of input tensors.

    tensor_intrin : TensorIntrin
        The tensor intrinsic used for computation.

    reduce_axis : IterVar
        An iteration variable representing the value.

    name: str, optional
        The name hint of the tensor

    tag: str, optional
        Additonal tag information about the compute.

Demo:

# tensor intrinsic
def intrin_vadd(n):
    x = tvm.placeholder((n,), dtype=dtype, name='vx')
    y = tvm.placeholder((n,), dtype=dtype, name='vy')
    z = tvm.compute(x.shape, lambda i: x[i] + y[i], name='z')

    def intrin_func(ins, outs):
        ib = tvm.ir_builder.create()
        ib.emit(tvm.call_extern(outs[0].dtype, 'vadd', ins[0].access_ptr("r"), ins[1].access_ptr('r'), outs[0].access_ptr('wr'), 8, 1, 1, 1, 8, 8, 8))
        return ib.get()

    return tvm.decl_tensor_intrin(z.op, intrin_func)


# tensorize way
def tensorize_vadd():
    A = tvm.placeholder((m/factor, factor), name='A')
    B = tvm.placeholder((m/factor, factor), name='B')
    C = tvm.compute(A.shape, lambda *i: A(*i) + B(*i), name='C')

    s = tvm.create_schedule(C.op)
    xo, xi = C.op.axis
    vadd = intrin_vadd(factor)
    s[C].tensorize(xi, vadd)
    print(tvm.lower(s, [A, B, C], simple_mode=True))


# tensor_op way
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)

    intrin = intrin_vadd(16)
    C = tvm.tensor_op([m/factor,], [factor,],
                      lambda i: [A[i, 0:factor], B[i, 0:factor]], 
                      intrin, name='C')

    s = tvm.create_schedule(C.op)
    print(tvm.lower(s, [A, B, C], simple_mode=True))

link to the PR: #1476

@tqchen @xqdan

The text was updated successfully, but these errors were encountered:

tqchen · 2018-07-25T05:14:06Z

cc @dmlc/tvm-team

xqdan · 2018-08-01T01:51:41Z

any update on this? @tqchen

tqchen · 2018-08-01T05:22:56Z

@ZihengJiang is going to send more updates on the proposal and the API behavior, we will start discussing from there.

ZihengJiang · 2018-08-09T04:25:45Z

@tqchen @yzhliu @tmoreau89 @merrymercy @cowanmeg welcome to comment

tmoreau89 · 2018-08-09T18:37:23Z

I like this proposal as it provides better control to the programmer over how tensorization is applied.

tmoreau89 · 2018-08-09T18:38:25Z

Another thing that is tricky with current tenorization, is error reporting (when the intrinsic does not match the code we apply tensorization on). Do you know if this new approach will lead to clearer error reporting?

b-y-e · 2018-08-10T03:29:17Z

the result's shape from tensor_op is not obvious and inconsistent with compute.

ZihengJiang · 2018-08-13T06:22:11Z

@b-y-e Because we need to distinguish the axis in tensorized region and others, so out_dims and in_dims are required. The final shape would be the concatenation of them. Welcome if you have other suggestion for the interface.

ZihengJiang · 2018-08-13T06:24:17Z

@tmoreau89 The only thing need to check in the new interface is the shape of tensorized region. It will report an error if they do not match. And the body part would be replaced directly.

tqchen · 2018-08-13T17:04:49Z

Thanks for all the discussions so far, this is a very important feature, as @xqdan and @ZihengJiang suggested, there are certain use cases for it. Here is a summary of my takes:

Significance

So far, there is a two-level separation of "expressions", the scalar index expression as in the compute body, and the tensor relation dataflow as in the chains of operators defined by the compute (topi).

Ideally, we want to ask: is it possible to also use tensor data type expressions within the compute themselves. Conceptually this offers a sense of "recursion". That is why this proposal is quite significant as it takes a step toward the direction.

Pros

It offers an alternative to the current tensorize:

The current tensorization requires everything being declared in scalar form, and mix match the tensor intrinsics
The proposed "tensor_op", allows us to directly put tensor_intrin(which implies a tensor level relation) as part of the body to specify the compute rule for a "region" instead of a single scalar index.
- could be useful when the region is hard to pattern-match, because essentially we are doing the pattern matching directly by invoking the intrinsic and pass in the variables.
It offers the first step toward "recursive, region level", can be useful for cases for example

Cons

There are certain drawbacks of abusing this notion vs tensorize

It is somewhat a violation of compute/schedule separation because the pattern matching step is done at the compute stage.
- This is, however, a practical consideration that is good for short-term but nevertheless need to keep in mind.
- We do need to keep this in mind
Divergence with the current tensorization API, and making tensor intrinsic black box
- The current proposal offers a nice solution to this problem and reuses tensor intrinsic for both tensorize and "tensor_op", which I really like
- This also means "tensor_op" is no longer blackbox, because if we want, we can replace the compute declaration and recover the original form

tqchen · 2018-08-13T17:10:18Z

My last post list summarizes some of the considerations. Due to the significance of this proposal, we would like to invite the community to have a good discussion on APIs. Specifically, we want to reach the following goals

Natural consistency with the compute API, so it feels like a tensor region generalization of compute
Consistency and reuse of tensorization infrastructure, so user don't get confused between the two and they feel like the APIs designed together.

Current Proposal

# tensor_op way
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)

    intrin = intrin_vadd(16)
    C = tvm.tensor_op([m/factor,], [factor,],
                      lambda i: [A[i, 0:factor], B[i, 0:factor]], 
                      intrin, name='C')

Mock Design 1

The tensor intrinsics declaration is the same in the @ZihengJiang 's example

# tensor_op way, via compute
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)
    C = tvm.compute((m/factor, factor),
          lambda i: intrin_add(A[i:, 0:factor], B[i, 0:factor]))

The key differences are

intrin_add is directly overloaded to indicate the invocation of the expression
The "internal axis" is implied via the dimension of the expression.
Make sure that the "internal axis" still exist, but cannot be scheduled
Note that the dependency can be inference via the expression, as in @ZihengJiang 's PR

Mock Design 2

# tensor_op way, via compute
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)
    C = tvm.compute((m/factor, factor),
          lambda i, j: intrin_add(A[i:, 0:size_of(j)], B[i, 0:size_of(j)]))

tqchen · 2018-08-13T18:36:49Z

@merrymercy @tmoreau89 @xqdan @cowanmeg @eqy @yzhliu @were @anijain2305 @dmlc/tvm-team please put your weight on the API proposal and technical questions.

tmoreau89 · 2018-08-13T19:08:24Z

Regarding the concerns that @tqchen raised, would it make sense to keep tensorization as it is, and introduce this new form of tensorization as "encapsulation" or "inlining"? This can maintain our clean separation of compute/scheduling, while introducing a convenient way to inlining high-performance intrinsics in schedules.

ZihengJiang · 2018-08-19T22:24:11Z

For the point The internal axis is implied via the dimension of the expression , I am doubt that the internal axis can be infered from the expression directly. For example, here is an example for matmul, and the relation of internal axis between C's and A's, B's is not obvious.

# tensor_op way, via compute
def tensor_op_gemm():
    A = tvm.placeholder((M/factor1, L/factor, factor1, factor), name="A", dtype=dtype)
    B = tvm.placeholder((N/factor2, L/factor, factor2, factor), name="B", dtype=dtype)
    k = tvm.reduce_axis(L/factor, name='k')
    C = tvm.compute((M/factor1, N/factor2, factor1, factor2),
          lambda i, j: intrin_gemm(A[i, k, 0:factor1, 0:factor], B[j, k, 0:factor2, 0:factor], reduce_axis=k))

tqchen · 2018-08-20T02:31:34Z

We decide the number of the internal axis by how many axis that are not passed into compute. in your case, because C is supposed to have 4 axis, but only two are passed into compute, so there are two internal axis

ZihengJiang · 2018-09-02T04:02:36Z

@tqchen As for the implementation, what is the return value of intrin_add(A[i, [0:factor], B[i, 0:factor])? It should be a statement, but in ComputeOp, body is an Expr.

Also, when we do intrin_add(A[i, 0:factor], B[i, 0:factor], we do not have the information of the output tensor, should we use a placeholder for it?

were · 2018-09-02T17:55:10Z

I like the 1st one better.

tqchen · 2018-09-06T01:36:15Z

intrin_add should be overloaded to check on the types of input and return an intermediate data structure when possible(like TensorIntrinCall)
compute function should be able to detect the special return type and return a different Op, e.g. TensorComputeOp
- But from the API's point of view user calls the same compute API

tqchen · 2018-10-18T21:56:06Z

closed by #1476

tqchen added the status: RFC label Jul 25, 2018

tqchen changed the title ~~[RFC] Another way to apply tensor intrinsic~~ [RFC] Generalize compute to tensor region Jul 26, 2018

tqchen self-assigned this Aug 13, 2018

tqchen added the priority: high label Aug 13, 2018

tqchen added this to the v0.5 milestone Aug 13, 2018

tqchen closed this as completed Oct 18, 2018

jdavies-huawei mentioned this issue Feb 15, 2019

[RFC][TVM] Extend TensorComputeOp to allow scalar inputs #2606

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Generalize compute to tensor region #1485

[RFC] Generalize compute to tensor region #1485

ZihengJiang commented Jul 25, 2018 •

edited

Loading

tqchen commented Jul 25, 2018

xqdan commented Aug 1, 2018

tqchen commented Aug 1, 2018

ZihengJiang commented Aug 9, 2018

tmoreau89 commented Aug 9, 2018

tmoreau89 commented Aug 9, 2018

b-y-e commented Aug 10, 2018

ZihengJiang commented Aug 13, 2018

ZihengJiang commented Aug 13, 2018 •

edited

Loading

tqchen commented Aug 13, 2018

tqchen commented Aug 13, 2018 •

edited

Loading

tqchen commented Aug 13, 2018 •

edited

Loading

tmoreau89 commented Aug 13, 2018

ZihengJiang commented Aug 19, 2018

tqchen commented Aug 20, 2018

ZihengJiang commented Sep 2, 2018 •

edited

Loading

were commented Sep 2, 2018

tqchen commented Sep 6, 2018

tqchen commented Oct 18, 2018

[RFC] Generalize compute to tensor region #1485

[RFC] Generalize compute to tensor region #1485

Comments

ZihengJiang commented Jul 25, 2018 • edited Loading

Target:

Motivation

Design

Interface

Demo:

tqchen commented Jul 25, 2018

xqdan commented Aug 1, 2018

tqchen commented Aug 1, 2018

ZihengJiang commented Aug 9, 2018

tmoreau89 commented Aug 9, 2018

tmoreau89 commented Aug 9, 2018

b-y-e commented Aug 10, 2018

ZihengJiang commented Aug 13, 2018

ZihengJiang commented Aug 13, 2018 • edited Loading

tqchen commented Aug 13, 2018

Significance

Pros

Cons

tqchen commented Aug 13, 2018 • edited Loading

Current Proposal

Mock Design 1

Mock Design 2

tqchen commented Aug 13, 2018 • edited Loading

tmoreau89 commented Aug 13, 2018

ZihengJiang commented Aug 19, 2018

tqchen commented Aug 20, 2018

ZihengJiang commented Sep 2, 2018 • edited Loading

were commented Sep 2, 2018

tqchen commented Sep 6, 2018

tqchen commented Oct 18, 2018

ZihengJiang commented Jul 25, 2018 •

edited

Loading

ZihengJiang commented Aug 13, 2018 •

edited

Loading

tqchen commented Aug 13, 2018 •

edited

Loading

tqchen commented Aug 13, 2018 •

edited

Loading

ZihengJiang commented Sep 2, 2018 •

edited

Loading