Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Generalize compute to tensor region #1485

Closed
ZihengJiang opened this issue Jul 25, 2018 · 19 comments
Closed

[RFC] Generalize compute to tensor region #1485

ZihengJiang opened this issue Jul 25, 2018 · 19 comments
Assignees
Milestone

Comments

@ZihengJiang
Copy link
Contributor

ZihengJiang commented Jul 25, 2018

Target:

  • Avoid verification in the original Tensorize, which we found it is not easy while detecting some patterns, like specifying abosolute/relative indexing.
  • Achieve better encapsulation. Vendors can build external package for their hardware, and user do not need to know the detail while writing their TVM declaration.
  • Propose a way to represent a block in tensor, which can be scheduled as an unit.

Motivation

The current tensorize method use TensorIntrin as a contract. While lowering, it will compare the body of the intrinsic‘s declaration and the body of the original declaration. After verifying the both are absolutely identical, the body of the original declaration will be replaced by intrinsic.

However, we found that it is not easy to guarantee/express the identity in some situation, especially involving the value of an index. This problem is related to the implementation of tensorize: normally, tensorize will replace the index in intrinsic with the index in the original declaration, so that the lowered code can be perfectly matched. For example, if we want to tensorize an addition intrinsic:

# intrinsic declaration
for k in 0...16
    A[k] = B[k] + C[k]

# original declaration
for i in 0...32
    for j in 0...16
        A[i*16+j] = B[i*16+j] + C[i*16+j]

We will replace k in intrinsic declaration with i*16+j, then to compare the ir whether is identical. However, if we require the value of index, things will change. The verification will fail unless we express absolute indexing and relative indexing clearly.

Design

Instead of proposing a way to express absolute indexing, we would like to come up with generalizing tvm.compute to the region of tensor. That means we can view a tensor intrinsic as an operation between regions, just like + as an operation between elements.

  • region: following numpy, we use slice to express a region in tensor: A[i, 0:16]
  • tensor_op: different with extern operation, we hope that we can keep the ability to schedule the part out of tensorized region. It means there are two kinds of axes: 1. axes in tensorized region, which cannot be scheduled; 2. axes out of the tensorized region, which can be scheduled through split, reorder, etc.

With this generalization, we can use tensor intrinsic in declaration directly, so verification is no longer needed. Also, this way provides better encapsulation, user do not need to know the detail of tensor intrinsic.

Interface

def tensor_op(out_dims,
              in_dims,
              finputs,
              tensor_intrin,
              reduce_axis=[],
              name='tensor_op',
              tag=""):
    """Construct new tensors with intrinsic.

    Parameters
    ----------
    out_dims: tuple
        The dimensions out of the tensorized region, which can be scheduled through `reorder`, `split`.

    in_dims: tuple
        The dimensions inside of the tensorized region, which cannot be manipulated.

    finputs: lambda function of out_dims -> list of TensorSlice
        Specifies involved regions of input tensors.

    tensor_intrin : TensorIntrin
        The tensor intrinsic used for computation.

    reduce_axis : IterVar
        An iteration variable representing the value.

    name: str, optional
        The name hint of the tensor

    tag: str, optional
        Additonal tag information about the compute.

Demo:

# tensor intrinsic
def intrin_vadd(n):
    x = tvm.placeholder((n,), dtype=dtype, name='vx')
    y = tvm.placeholder((n,), dtype=dtype, name='vy')
    z = tvm.compute(x.shape, lambda i: x[i] + y[i], name='z')

    def intrin_func(ins, outs):
        ib = tvm.ir_builder.create()
        ib.emit(tvm.call_extern(outs[0].dtype, 'vadd', ins[0].access_ptr("r"), ins[1].access_ptr('r'), outs[0].access_ptr('wr'), 8, 1, 1, 1, 8, 8, 8))
        return ib.get()

    return tvm.decl_tensor_intrin(z.op, intrin_func)


# tensorize way
def tensorize_vadd():
    A = tvm.placeholder((m/factor, factor), name='A')
    B = tvm.placeholder((m/factor, factor), name='B')
    C = tvm.compute(A.shape, lambda *i: A(*i) + B(*i), name='C')

    s = tvm.create_schedule(C.op)
    xo, xi = C.op.axis
    vadd = intrin_vadd(factor)
    s[C].tensorize(xi, vadd)
    print(tvm.lower(s, [A, B, C], simple_mode=True))


# tensor_op way
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)

    intrin = intrin_vadd(16)
    C = tvm.tensor_op([m/factor,], [factor,],
                      lambda i: [A[i, 0:factor], B[i, 0:factor]], 
                      intrin, name='C')

    s = tvm.create_schedule(C.op)
    print(tvm.lower(s, [A, B, C], simple_mode=True))

link to the PR: #1476

@tqchen @xqdan

@tqchen
Copy link
Member

tqchen commented Jul 25, 2018

cc @dmlc/tvm-team

@tqchen tqchen changed the title [RFC] Another way to apply tensor intrinsic [RFC] Generalize compute to tensor region Jul 26, 2018
@xqdan
Copy link
Contributor

xqdan commented Aug 1, 2018

any update on this? @tqchen

@tqchen
Copy link
Member

tqchen commented Aug 1, 2018

@ZihengJiang is going to send more updates on the proposal and the API behavior, we will start discussing from there.

@ZihengJiang
Copy link
Contributor Author

@tqchen @yzhliu @tmoreau89 @merrymercy @cowanmeg welcome to comment

@tmoreau89
Copy link
Contributor

I like this proposal as it provides better control to the programmer over how tensorization is applied.

@tmoreau89
Copy link
Contributor

Another thing that is tricky with current tenorization, is error reporting (when the intrinsic does not match the code we apply tensorization on). Do you know if this new approach will lead to clearer error reporting?

@b-y-e
Copy link

b-y-e commented Aug 10, 2018

the result's shape from tensor_op is not obvious and inconsistent with compute.

@ZihengJiang
Copy link
Contributor Author

@b-y-e Because we need to distinguish the axis in tensorized region and others, so out_dims and in_dims are required. The final shape would be the concatenation of them. Welcome if you have other suggestion for the interface.

@ZihengJiang
Copy link
Contributor Author

ZihengJiang commented Aug 13, 2018

@tmoreau89 The only thing need to check in the new interface is the shape of tensorized region. It will report an error if they do not match. And the body part would be replaced directly.

@tqchen tqchen self-assigned this Aug 13, 2018
@tqchen tqchen added this to the v0.5 milestone Aug 13, 2018
@tqchen
Copy link
Member

tqchen commented Aug 13, 2018

Thanks for all the discussions so far, this is a very important feature, as @xqdan and @ZihengJiang suggested, there are certain use cases for it. Here is a summary of my takes:

Significance

So far, there is a two-level separation of "expressions", the scalar index expression as in the compute body, and the tensor relation dataflow as in the chains of operators defined by the compute (topi).

Ideally, we want to ask: is it possible to also use tensor data type expressions within the compute themselves. Conceptually this offers a sense of "recursion". That is why this proposal is quite significant as it takes a step toward the direction.

Pros

It offers an alternative to the current tensorize:

  • The current tensorization requires everything being declared in scalar form, and mix match the tensor intrinsics
  • The proposed "tensor_op", allows us to directly put tensor_intrin(which implies a tensor level relation) as part of the body to specify the compute rule for a "region" instead of a single scalar index.
    • could be useful when the region is hard to pattern-match, because essentially we are doing the pattern matching directly by invoking the intrinsic and pass in the variables.
  • It offers the first step toward "recursive, region level", can be useful for cases for example

Cons

There are certain drawbacks of abusing this notion vs tensorize

  • It is somewhat a violation of compute/schedule separation because the pattern matching step is done at the compute stage.
    • This is, however, a practical consideration that is good for short-term but nevertheless need to keep in mind.
    • We do need to keep this in mind
  • Divergence with the current tensorization API, and making tensor intrinsic black box
    • The current proposal offers a nice solution to this problem and reuses tensor intrinsic for both tensorize and "tensor_op", which I really like
    • This also means "tensor_op" is no longer blackbox, because if we want, we can replace the compute declaration and recover the original form

@tqchen
Copy link
Member

tqchen commented Aug 13, 2018

My last post list summarizes some of the considerations. Due to the significance of this proposal, we would like to invite the community to have a good discussion on APIs. Specifically, we want to reach the following goals

  • Natural consistency with the compute API, so it feels like a tensor region generalization of compute
  • Consistency and reuse of tensorization infrastructure, so user don't get confused between the two and they feel like the APIs designed together.

Current Proposal

# tensor_op way
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)

    intrin = intrin_vadd(16)
    C = tvm.tensor_op([m/factor,], [factor,],
                      lambda i: [A[i, 0:factor], B[i, 0:factor]], 
                      intrin, name='C')

Mock Design 1

The tensor intrinsics declaration is the same in the @ZihengJiang 's example

# tensor_op way, via compute
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)
    C = tvm.compute((m/factor, factor),
          lambda i: intrin_add(A[i:, 0:factor], B[i, 0:factor]))

The key differences are

  • intrin_add is directly overloaded to indicate the invocation of the expression
  • The "internal axis" is implied via the dimension of the expression.
  • Make sure that the "internal axis" still exist, but cannot be scheduled
  • Note that the dependency can be inference via the expression, as in @ZihengJiang 's PR

Mock Design 2

# tensor_op way, via compute
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)
    C = tvm.compute((m/factor, factor),
          lambda i, j: intrin_add(A[i:, 0:size_of(j)], B[i, 0:size_of(j)]))

@tqchen
Copy link
Member

tqchen commented Aug 13, 2018

@merrymercy @tmoreau89 @xqdan @cowanmeg @eqy @yzhliu @were @anijain2305 @dmlc/tvm-team please put your weight on the API proposal and technical questions.

@tmoreau89
Copy link
Contributor

Regarding the concerns that @tqchen raised, would it make sense to keep tensorization as it is, and introduce this new form of tensorization as "encapsulation" or "inlining"? This can maintain our clean separation of compute/scheduling, while introducing a convenient way to inlining high-performance intrinsics in schedules.

@ZihengJiang
Copy link
Contributor Author

For the point The internal axis is implied via the dimension of the expression , I am doubt that the internal axis can be infered from the expression directly. For example, here is an example for matmul, and the relation of internal axis between C's and A's, B's is not obvious.

# tensor_op way, via compute
def tensor_op_gemm():
    A = tvm.placeholder((M/factor1, L/factor, factor1, factor), name="A", dtype=dtype)
    B = tvm.placeholder((N/factor2, L/factor, factor2, factor), name="B", dtype=dtype)
    k = tvm.reduce_axis(L/factor, name='k')
    C = tvm.compute((M/factor1, N/factor2, factor1, factor2),
          lambda i, j: intrin_gemm(A[i, k, 0:factor1, 0:factor], B[j, k, 0:factor2, 0:factor], reduce_axis=k))

@tqchen
Copy link
Member

tqchen commented Aug 20, 2018

We decide the number of the internal axis by how many axis that are not passed into compute. in your case, because C is supposed to have 4 axis, but only two are passed into compute, so there are two internal axis

@ZihengJiang
Copy link
Contributor Author

ZihengJiang commented Sep 2, 2018

@tqchen As for the implementation, what is the return value of intrin_add(A[i, [0:factor], B[i, 0:factor])? It should be a statement, but in ComputeOp, body is an Expr.

Also, when we do intrin_add(A[i, 0:factor], B[i, 0:factor], we do not have the information of the output tensor, should we use a placeholder for it?

@were
Copy link
Contributor

were commented Sep 2, 2018

I like the 1st one better.

@tqchen
Copy link
Member

tqchen commented Sep 6, 2018

  • intrin_add should be overloaded to check on the types of input and return an intermediate data structure when possible(like TensorIntrinCall)
  • compute function should be able to detect the special return type and return a different Op, e.g. TensorComputeOp
    • But from the API's point of view user calls the same compute API

@tqchen
Copy link
Member

tqchen commented Oct 18, 2018

closed by #1476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants