Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TVM] Automatic differentiation for tensor expressions #2498

Closed
wants to merge 11 commits into from

Conversation

sgrechanik-h
Copy link
Contributor

This PR adds automatic differentiation for tensor expressions (#1996). The implementation can be
naturally divided into two parts:

  • The automatic differentiation itself (autodiff.cc, autodiff.h, autodiff.py, test_pass_autodiff.py). It contains an implementation of reverse-mode automatic differentiation (the function Differentiate) and standard differentiation rules (the class JacobianMutator).
  • Transformations for optimizing the code generated by the autodiff (zero_elimination.cc, zero_elimination.h, test_pass_zero_elimination.py). This part is called zero elimination because its goal is to eliminate generation of zero values in intermediate computations by fusing tensors and transforming domains of iteration. Its central part is Fourier-Motzkin elimination, currently we use our own implementation (the function SolveSystemOfInequalities), however in the future we might want to consider using some libraries for manipulating polyhedra (which may be useful for other parts of tvm as well).

There are also several helpers: functions for printing out tensors recursively in a human-readable form (dump_tensor.cc) and functions for applying transformations to compute bodies (in op_util.cc).

Testing of the automatic differentiation is done in two ways:

  • The correctness is tested by comparing to numerically computed gradients.
  • We try to catch performance regressions by statically estimating some metrics (the number of operations, iterations, etc). An exception is thrown if some metric is more than 1.5x worse than the reference value.

Currently there are several thing missing from this PR:

  • There is no solver for systems of linear integer equations. Instead, some ad-hoc transformations are used which don't work in many cases and are very sensitive to things like the order of variables. For this reason gradients for some operations like dilated and strided convolutions are not as performant as they could be. This is the main deficiency which should be fixed in subsequent PRs.
  • No tutorials and documentation, I'm planning to add them in a separate PR.
  • Some operations (e.g. scan) are not supported.
  • No integration with Relay and NNVM, we plan to do it in the future.

@masahi
Copy link
Member

masahi commented Jan 24, 2019

This looks super cool :) just wondering what is the plan to schedule the generated gradient expression (for CPU and GPU) ? Is it intended to be used with auto scheduler @merrymercy is working on?

@sgrechanik-h
Copy link
Contributor Author

@masahi Yes, it is supposed to be used with an autoscheduler. Scheduling the generated code manually is possible, but I don't think it's practical.

@kaitingwang
Copy link
Contributor

Would using a linear equation solver (e.g. IBM CPLEX, that's proprietary, or open source GLPK) help ? Another question, will control flow be supported? e.g. https://arxiv.org/abs/1810.07951

@Ravenwater
Copy link

@kaitingwang I would advise against GLPK as it wouldn't be able to take advantage of different arithmetics. I would be looking for a templated C++ code that provides the 'specific' algorithm you are going to select for the scheduling. No need to pull in a general linear programming library and weight down this part of TVM.

adjoints and populate the corresponding dict, but the list of results
will be empty).

head : Tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be understood as the gradient of the output tensor w.r.t. the final output of a math expression? Could you explain why its shape is output.shape+output.shape when None is passed? My understanding is that it should default to ones(output.shape). Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please disregard my question above. After reading your implementation of DiffBuildingBlock, I think I got the meaning of defaulting head to shape output.shape+output.shape.

@sgrechanik-h
Copy link
Contributor Author

Would using a linear equation solver (e.g. IBM CPLEX, that's proprietary, or open source GLPK) help ?

Using some third party libraries for doing well-known stuff like solving systems of linear integer equations is a good idea, however we should think through what particular libraries we should use in tvm.

Another question, will control flow be supported? e.g. https://arxiv.org/abs/1810.07951

No, differentiating through control flow structures is done separately on the Relay level.

@reminisce
Copy link
Contributor

@sgrechanik-h I have a question regarding the performance of auto diff, specifically about DiffBuildingBlock.

Taking the final loss as a scalar value for example, suppose we have an input tensor of shape (i1, i2, ..., i_m), an output tensor of shape (j1, j2, ..., j_n), the adjoint of the output with shape (j1, j2, ..., j_n), and the Jacobian of the output w.r.t. the input is (j1, j2, ..., j_n, i1, i2, ..., i_m). In DiffBuildingBlock, it performs a tensor dot operation on the head and Jacobian to get the adjoint of the input. While this is a generic way of realizing auto diff, a tensor dot operation can actually be avoided for many operators whose forward algorithms are known and Jacobian is sparse.

Maybe you have taken the situation like this into the following optimization passes. Just want ask

  1. If you have compared the performance of autodiff with the gradient kernels manually devised in MXNet/Pytorch/TF?
  2. Any guideline on when should users provide their own fdiff functions to tvm.autodiff.differentiate?

@sgrechanik-h
Copy link
Contributor Author

@reminisce Yes, DiffBuildingBlock performs several transformations to fuse the tensor dot product with the Jacobian computation.

  1. If you have compared the performance of autodiff with the gradient kernels manually devised in MXNet/Pytorch/TF?

I haven't yet. The main problem of such a comparison is that we need to schedule the generated gradients somehow. So far I've only tried to compare the overall performance of a couple of simple neural networks with Keras/TF on CPU. I used a very naive scheduling method (just fuse axes, and then split the resulting axis to parallelize), so the forward pass was about 1.7x slower on tvm, and the backward pass was about 3x slower. So I would currently estimate the generated gradients to be about 2x "worse" than the manually written ones. However, this is very operation-dependent, so I should compare performance for separate operations in the future.

  1. Any guideline on when should users provide their own fdiff functions to tvm.autodiff.differentiate?

I was thinking of two use cases:

  • When the autodiff does a poor job and we have to perform differentiation manually for certain tensors.
  • When the user wants to try a custom set or order of optimizations (different from the ones used in DiffBuildingBlock).

Currently there is no way to find out if the autodiff did a poor job, other than run and measure the performance, or look at the generated code, but I'm thinking about writing a simple analyzer which will issue a warning if the generated code looks bad.

@kaitingwang
Copy link
Contributor

I've gone over the paper: "Automatic Differentiation for Tensor Algebras" where your work is based on. (I have not dived into your code yet) Please correct me if I've misunderstood. Your work is based on solving A * alpha = beta where 1<= alpha <= Nf (equation 39), with a fallback to equation 40 (if cannot be solved, one must sum over the sparsity of zeros). The math is quite involved so I'm wondering why you've picked this paper and if the mathematically soundness has been checked... Thanks.

@sgrechanik-h
Copy link
Contributor Author

Your work is based on solving A * alpha = beta where 1<= alpha <= Nf (equation 39), with a fallback to equation 40 (if cannot be solved, one must sum over the sparsity of zeros).

Essentially, yes. More precisely, I first differentiate naively, generating something similar to the equation (40), and then optimize this expression by generating sets of the form (39) for each loop and trying to simplify them.

The math is quite involved so I'm wondering why you've picked this paper and if the mathematically soundness has been checked... Thanks.

It's quite overcomplicated in the paper, but the overall procedure is quite modular, so the soundness should follow from the soundness of every small step (the naive differentiation and every optimizing tranformation). At the same time the paper contains all the necessary details, so that must be why I've chosen it.

@reminisce
Copy link
Contributor

@sgrechanik-h Thanks for the reply. This is really nice work. IMHO, the value of this auto diff feature lies in providing a generic way of generating gradient expression for operators, so that developers can avoid tedious implementation of gradient expression. To achieve reasonable performance in BP, we can attach schedules at operator level.

Two more questions:

  1. How is race condition handled in BP in this PR? For example, topi.take uses a input indices tensor to get elements from the input data. When there are duplicate indices provided, there is a race condition at the time of back-propagating output gradient to the same location of the input gradient tensor. How is this problem resolved in this PR?

  2. I have tried differentiate on several operators in topi. Most of them can return correct results, but there are a few errors. The following is one of them. Any idea where went wrong?

import tvm
import topi

m, n = 40, 50
data = tvm.placeholder((m, n), name='data')
indices = tvm.placeholder((1,), dtype='int32', name='indices')
ret = topi.take(data, indices)
head = topi.full_like(ret, 2.0)
diff_ret = tvm.differentiate(ret, [data], head)

# error message:
# Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

@sgrechanik-h
Copy link
Contributor Author

sgrechanik-h commented Jan 29, 2019

@reminisce,

  1. How is race condition handled in BP in this PR? For example, topi.take uses a input indices tensor to get elements from the input data. When there are duplicate indices provided, there is a race condition at the time of back-propagating output gradient to the same location of the input gradient tensor.

This implementation of autodiff doesn't create race conditions in the first place, instead it generates huge naive loops directly. For example, in the case of take it will check every single index for every single element of the output:

  for (ax0, 0, 40) {
    for (ax1, 0, 50) {
      grad[((ax0*50) + ax1)] = 0.000000f
      for (k0, 0, 10) {
        if (((ax0 == ((indices[k0]/50) % 40)) && (ax1 == (indices[k0] % 50)))) {
          grad[((((indices[k0]/50) % 40)*50) + (indices[k0] % 50))] = grad[((((indices[k0]/50) % 40)*50) + (indices[k0] % 50))] + head[k0]
        }
      }
    }
  }

That is, take may be an example when this differentiation algorithm is not a good choice. (Upd: I think, it is possible to rearrange the loops so that k0 become the outer variable, and then eliminate the variables ax0 and ax0 by merging the corresponding loops with the condition. However it seems that this cannot be expressed on the tensor expression level and should be done during or after the scheduling phase.)

  1. I have tried differentiate on several operators in topi. Most of them can return correct results, but there are a few errors. The following is one of them. Any idea where went wrong?

Thank you for reporting, I'll upload the fix soon.

@@ -0,0 +1,146 @@
/*!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current location of autodiff.h prevents it from being used by standalone C++ applications since they don't have an access to src folder. Could we move this file to tvm/include ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@reminisce
Copy link
Contributor

@sgrechanik-h

This implementation of autodiff doesn't create race conditions in the first place, instead it generates huge naive loops directly. For example, in the case of take it will check every single index for every single element of the output:

You are right. I misunderstood the problem. It should be handled at the schedule level. Thanks.

@reminisce
Copy link
Contributor

@sgrechanik-h I'm reporting another crash. It is fine if data1 and data2 have the same shape, while crashes when broadcasting is involved. Could you take a look? Thanks.

import tvm
import topi


m = tvm.var('m')
n = tvm.var('n')
data1 = tvm.placeholder((m, n), name='data1')
data2 = tvm.placeholder((n,), name='data2')
out = topi.add(data1, data2)
head = topi.full_like(out, 2.0)
tvm.differentiate(out, [data1, data2], head)

Error message:

libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: unordered_map::at: key not found

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

@sgrechanik-h
Copy link
Contributor Author

@reminisce Thanks, the problem is due to free variables m and n. I've fixed this particular bug. However such free variables are not really supported, the generated gradients should be correct (although I expect more bugs since I haven't thoroughly tested this case), but I would expect their performance to be much worse.

@kaitingwang
Copy link
Contributor

@sgrechanik-h I tried the following testcase and am puzzled about why you needed to 'shift' the identity matrix to avoid the out-of-bound situation (i.e. allocating 10x3x17x3 for identity) even when you had put in the if-guard to prevent this situation. (i.e. if ((((((k1 <= ax2) && ((ax2 - k1) <= 9)) && ((ax2 + -9) <= k1)) && (k1 <= ax2)) && (ax2 <= 11))))

  1 from __future__ import absolute_import
  2 import tvm
  3 import topi
  4
  5 A = tvm.placeholder((15,), name = "A")
  6 B = tvm.compute((10,3), lambda i,j: A[i+j], name = "B")
  7
  8 print("-----------FORWARD-----------------")
  9 print(tvm.PrintTensorRecursively(B))
 10 sout = tvm.create_schedule(B.op)
 11 ir = tvm.lower(sout, [B,A], simple_mode=True)
 12 print(ir)
 13
 14 print("-----------BACKWARD---------------------")
 15 [jacs] = list(tvm.differentiate(B, [A]))
 16 print(tvm.PrintTensorRecursively(jacs))
 17 sjac = tvm.create_schedule(jacs.op)
 18 ir_jacs = tvm.lower(sjac, [jacs], simple_mode=True)
 19 print(ir_jacs)

Lowered code:

// attr [identity] storage_scope = "global"
allocate identity[float32 * 10 * 3 * 17 * 3]
// attr [tensor] storage_scope = "global"
allocate tensor[float32 * 15 * 3 * 10]
produce identity {
  for (ax0, 0, 10) {
    for (ax1, 0, 3) {
      for (ax2, 0, 17) {
        for (ax3, 0, 3) {
          if (likely((2 <= ax2))) {
            if (likely((ax2 < 12))) {
              identity[((((((ax0*3) + ax1)*17) + ax2)*3) + ax3)] = float32((((ax0 - ax2) == -2) && (ax1 == ax3)))
            }
          }
        }
      }
    }
  }
}
produce tensor {
  for (ax2, 0, 15) {
    for (ax1, 0, 3) {
      for (ax0, 0, 10) {
        tensor[((((ax2*3) + ax1)*10) + ax0)] = 0.000000f
        for (k1, 0, 3) {
          if (likely((ax2 < 12))) {
            if ((((((k1 <= ax2) && ((ax2 - k1) <= 9)) && ((ax2 + -9) <= k1)) && (k1 <= ax2)) && (ax2 <= 11))) {
              tensor[((((ax2*3) + ax1)*10) + ax0)] = (tensor[((((ax2*3) + ax1)*10) + ax0)] + identity[(((((ax2 + (ax1*17)) + (ax0*51))*3) - (k1*2)) + 6)])
            }
          }
        }
      }
    }
  }
}

@sgrechanik-h
Copy link
Contributor Author

@kaitingwang This is a problem of TVM. At some point during or immediately after scheduling, TVM analyzes bounds of tensor arguments and expands bounds of tensors if it thinks that out-of-bounds might happen. However, bound analysis is difficult, and this sometimes results in overapproximation. In this case TVM couldn't properly infer the bounds from the conditions (or it may have ignored the conditions completely, I'm not sure). I tried to fix this behavior of TVM once, but failed: #2104

(Note that there is another problem with this example: the identity tensor was not inlined, although this would have been very beneficial. The reason is that I don't perform inlining of adjoints since in the case of gradients the adjoints are rarely sparse and inlining them results in worse performance. However, in the case of arbitrary Jacobians, like in this case, inlining of adjoints is necessary, so I have to think what to do about it.)

@sgrechanik-h
Copy link
Contributor Author

I've updated the PR with a Relay integration. Basically, calling relay._ir_pass.AutogeneratePrimalGradientForAll() will populate the FPrimalGradient attribute of every relay operation with automatically generated gradients. For some operations automatically generated gradients fail with compile or run-time errors, usually this is caused by multi-output operations (which don't seem to be fully supported by the relay ad yet) and by integer inputs and outputs.

@sgrechanik-h
Copy link
Contributor Author

@tqchen zero-elimination is independent and it makes sense to create a separate PR, I will do it soon.

@sgrechanik-h
Copy link
Contributor Author

A link to my comment containing things to discuss before merging.

include/tvm/autodiff.h Outdated Show resolved Hide resolved
include/tvm/autodiff.h Outdated Show resolved Hide resolved
@@ -206,6 +206,34 @@ Expr ReplaceTensor(Expr expr,
}


void ReplaceTensorRecursivelyImpl(Tensor tensor,
std::unordered_map<Tensor, Tensor>* replace) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use std::unordered_map<Tensor, Tensor, NodeHash, NodeEqual>*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The Tensor class has its own coarser notion of equality and it has its own specialization of std::hash. I think using the ordinary reference equality may cause problems even for this particular function.

src/pass/autodiff.cc Outdated Show resolved Hide resolved
tutorials/language/autodiff_basics.py Show resolved Hide resolved
src/relay/op/autodiff_integration.cc Outdated Show resolved Hide resolved
@sgrechanik-h
Copy link
Contributor Author

Zero elimination PR #2634

@altanh
Copy link
Contributor

altanh commented May 14, 2019

@sgrechanik-h @tqchen Hi, what is the status of this and #2634, #2588? In trying to implement the missing primal gradients for #2562 (to enable training), it seems like there will be a high per-operator time cost, so we'd like to understand where this AD (along with the other features) is in terms of progress.

@sgrechanik-h
Copy link
Contributor Author

@altanh I've recently implemented a solver for integer systems of equations which improved performance of the code generated by the autodiff, however this triggered some bugs which are probably connected to missimplification by the Halide simplifier (due to different division semantics), so I'm currently moving the autodiff to the new tvm simplifiers. When I'm done, I'll update the pull requests.

@altanh
Copy link
Contributor

altanh commented May 17, 2019

@sgrechanik-h that sounds great, thank you for the update! For now we'll continue working through the list of gradients and implementing them by hand, since the consensus is that they'll be invaluable for future things like optimizing after AD, higher order AD, etc.

@yzhliu
Copy link
Member

yzhliu commented Jul 21, 2019

@sgrechanik-h Does this PR realize the linear inequality solver (to eliminate Kronecker delta within summation) described in the paper you mentioned?

@sgrechanik-h
Copy link
Contributor Author

@yzhliu Yes, it does. The more up-to-date autodiff-dev branch also implements the solver for systems of linear equations (and should work on top of the current TVM master).

Copy link
Contributor

@kaitingwang kaitingwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is not possible since softmax is not an element-wise operation. Therefore, one would need to fully complete the forward softmax computation, in order to use it in the backward computation.

return Mul::make(Mutate(op->args[0]), e);
} else if (op->name == "log") {
return Div::make(Mutate(op->args[0]), op->args[0]);
} else if (op->name == "sigmoid") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgrechanik-h I like this approach to express the gradient of sigmoid in terms of forward sigmoid (i.e. held in "e") I'm wondering how to extend this for softmax which is a bit trickier. The reason is that the gradient for softmax can also be expressed as forward softmax but it depends if 'i == j'
image

How would you express this here?

PS. the softmax gradient is taken from : https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since softmax is not elementwise, we can't express it here, so I think it should be done through the gradient overriding mechanism (or on the Relay level).

Copy link
Contributor

@kaitingwang kaitingwang Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Although softmax can be inside a large expression tree (e.g. focalloss) and as such, we'd want to autodiff that expression tree (using the Jacobian logic you have there), but when encountering the softmax operator, use the relay overridden softmax gradient instead. How would you invoke a relay registered gradient from the Jacobian function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are several problems:

  • How to override differentiation rules for some specific tensor: in my implementation there is a mechanism for this, e.g. the option override which can be used as described here
  • How to detect the operation for which we want to override gradients, in this case softmax. The overriding mechanism simply compares tensors by reference, but this equality seems to be too fine-grained in practice, because sortmax(A) and softmax(B) are two different tensors on the tensor expression level. I think the practical solution would be to perform gradient overriding by the tensor's tag or name, however currently it is not implemented.
  • How to reuse gradients implemented in Relay. I don't know, but I think it should be relatively straightforward to lower a Relay graph representing a gradient into a tensor expression (however I have no idea what to do with schedules).

Copy link
Contributor

@kaitingwang kaitingwang Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply! Took me a while to understand but I like your clever design that when override is specified, fdiff is essentially a wrapper. That is, depending on whether the out tensor is specified in the override or not, one would either use the default fdiff, or the one specified inside the override dictionary. However, I found your example below unintuitive:

          # a generalization of my_fdiff which works for non-immediate dependencies
        # this is necessary because z1 is not an immediate dep of z2 because of padding
        def my_diff(out, inputs, head):
            return tvm.differentiate(out, inputs, head, fdiff=my_fdiff)

        # using a custom differentiation function only for z2
        res = tvm.differentiate(y, [w1, w2], override={z2: ([z1, w2], my_diff)})

If I write my own custom fdiff for z2 w.r.t. z1, or w2, I'd prefer that my_diff returning a tensor (i.e. a tvm.compute that is the custom gradient) instead of DifferentiationResult. (not sure how to construct such object in python...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I jut realize that I can indeed do so, by putting the custom gradient tensors into a list and returning that from my_diff. DifferentiationResult is simply a list of tensors!

@MarisaKirisame
Copy link
Contributor

@sgrechanik-h what if you use a more traditional method of doing automatic differentiation - instead of working on sparse tensor, generate tvmir code with mutation and do the standard wengert list there. See Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition for example.

@sgrechanik-h
Copy link
Contributor Author

@MarisaKirisame I guess it will result in a lots of += operations which will require synchronization if we want to parallelize the resulting code.

@merrymercy
Copy link
Member

merrymercy commented Feb 9, 2020

@sgrechanik-h What's the current status of this project? Are there any updates?

@sgrechanik-h
Copy link
Contributor Author

@merrymercy Unfortunately I don't have time to work on it now. The last time I rebased my branch on the latest TVM and checked that it still worked was 2 months ago (the branch is here: https://github.com/sgrechanik-h/tvm/tree/autodiff-dev).

@yzhliu
Copy link
Member

yzhliu commented Feb 10, 2020

@sgrechanik-h would you mind if someone take over and rebuild it on top of you work?

@sergei-grechanik
Copy link
Contributor

@yzhliu Yeah, absolutely, I don't mind.

@yzhliu
Copy link
Member

yzhliu commented Mar 6, 2020

@sgrechanik-h I have difficulty understand this part, would be helpful if you can provide some guide,

say we have A x = b => U^T S V^T x = b the solution for x is V S^{-1} U b + K z, in which K is the right m-r columns of V and z is arbitrary vector. in my understanding,

  • old_to_new is V^T
  • matrix is S
  • rhs is U b
  • solution[] in the code is V^T V S^{-1} U b + V^T K z

Then the for-loops in the code calculates V^T * solution, how does it equal to original x ?

@sergei-grechanik
Copy link
Contributor

@yzhliu solution in my code seems to be S^{-1} U b + V^{-1} K z, i.e. it is the solution for y of S y = U b (where y = V^{-1} x). Sorry for the confusing naming.

@yzhliu
Copy link
Member

yzhliu commented Mar 6, 2020

@sgrechanik-h thanks for reply. that's the same as what I thought. if solution = V^{-1} x, what does old_to_new * solution mean? In my understanding old_to_new = V^{-1}, then old_to_new * solution = V^{-2} x, did I miss something?

@sergei-grechanik
Copy link
Contributor

@yzhliu Oh, I guess old_to_new must be V then, not V^{-1}

@tqchen
Copy link
Member

tqchen commented Mar 30, 2020

Superseded by #5121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need RFC need RFC discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.