-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TVM] Automatic differentiation for tensor expressions #2498
Conversation
This looks super cool :) just wondering what is the plan to schedule the generated gradient expression (for CPU and GPU) ? Is it intended to be used with auto scheduler @merrymercy is working on? |
@masahi Yes, it is supposed to be used with an autoscheduler. Scheduling the generated code manually is possible, but I don't think it's practical. |
Would using a linear equation solver (e.g. IBM CPLEX, that's proprietary, or open source GLPK) help ? Another question, will control flow be supported? e.g. https://arxiv.org/abs/1810.07951 |
@kaitingwang I would advise against GLPK as it wouldn't be able to take advantage of different arithmetics. I would be looking for a templated C++ code that provides the 'specific' algorithm you are going to select for the scheduling. No need to pull in a general linear programming library and weight down this part of TVM. |
adjoints and populate the corresponding dict, but the list of results | ||
will be empty). | ||
|
||
head : Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be understood as the gradient of the output tensor w.r.t. the final output of a math expression? Could you explain why its shape is output.shape+output.shape
when None
is passed? My understanding is that it should default to ones(output.shape)
. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please disregard my question above. After reading your implementation of DiffBuildingBlock
, I think I got the meaning of defaulting head
to shape output.shape+output.shape
.
Using some third party libraries for doing well-known stuff like solving systems of linear integer equations is a good idea, however we should think through what particular libraries we should use in tvm.
No, differentiating through control flow structures is done separately on the Relay level. |
@sgrechanik-h I have a question regarding the performance of auto diff, specifically about Taking the final loss as a scalar value for example, suppose we have an input tensor of shape Maybe you have taken the situation like this into the following optimization passes. Just want ask
|
@reminisce Yes,
I haven't yet. The main problem of such a comparison is that we need to schedule the generated gradients somehow. So far I've only tried to compare the overall performance of a couple of simple neural networks with Keras/TF on CPU. I used a very naive scheduling method (just fuse axes, and then split the resulting axis to parallelize), so the forward pass was about 1.7x slower on tvm, and the backward pass was about 3x slower. So I would currently estimate the generated gradients to be about 2x "worse" than the manually written ones. However, this is very operation-dependent, so I should compare performance for separate operations in the future.
I was thinking of two use cases:
Currently there is no way to find out if the autodiff did a poor job, other than run and measure the performance, or look at the generated code, but I'm thinking about writing a simple analyzer which will issue a warning if the generated code looks bad. |
I've gone over the paper: "Automatic Differentiation for Tensor Algebras" where your work is based on. (I have not dived into your code yet) Please correct me if I've misunderstood. Your work is based on solving A * alpha = beta where 1<= alpha <= Nf (equation 39), with a fallback to equation 40 (if cannot be solved, one must sum over the sparsity of zeros). The math is quite involved so I'm wondering why you've picked this paper and if the mathematically soundness has been checked... Thanks. |
Essentially, yes. More precisely, I first differentiate naively, generating something similar to the equation (40), and then optimize this expression by generating sets of the form (39) for each loop and trying to simplify them.
It's quite overcomplicated in the paper, but the overall procedure is quite modular, so the soundness should follow from the soundness of every small step (the naive differentiation and every optimizing tranformation). At the same time the paper contains all the necessary details, so that must be why I've chosen it. |
@sgrechanik-h Thanks for the reply. This is really nice work. IMHO, the value of this auto diff feature lies in providing a generic way of generating gradient expression for operators, so that developers can avoid tedious implementation of gradient expression. To achieve reasonable performance in BP, we can attach schedules at operator level. Two more questions:
import tvm
import topi
m, n = 40, 50
data = tvm.placeholder((m, n), name='data')
indices = tvm.placeholder((1,), dtype='int32', name='indices')
ret = topi.take(data, indices)
head = topi.full_like(ret, 2.0)
diff_ret = tvm.differentiate(ret, [data], head)
# error message:
# Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) |
This implementation of autodiff doesn't create race conditions in the first place, instead it generates huge naive loops directly. For example, in the case of
That is,
Thank you for reporting, I'll upload the fix soon. |
src/pass/autodiff.h
Outdated
@@ -0,0 +1,146 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current location of autodiff.h
prevents it from being used by standalone C++ applications since they don't have an access to src folder. Could we move this file to tvm/include
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
You are right. I misunderstood the problem. It should be handled at the schedule level. Thanks. |
@sgrechanik-h I'm reporting another crash. It is fine if import tvm
import topi
m = tvm.var('m')
n = tvm.var('n')
data1 = tvm.placeholder((m, n), name='data1')
data2 = tvm.placeholder((n,), name='data2')
out = topi.add(data1, data2)
head = topi.full_like(out, 2.0)
tvm.differentiate(out, [data1, data2], head) Error message:
|
@reminisce Thanks, the problem is due to free variables |
@sgrechanik-h I tried the following testcase and am puzzled about why you needed to 'shift' the identity matrix to avoid the out-of-bound situation (i.e. allocating 10x3x17x3 for identity) even when you had put in the if-guard to prevent this situation. (i.e.
Lowered code:
|
@kaitingwang This is a problem of TVM. At some point during or immediately after scheduling, TVM analyzes bounds of tensor arguments and expands bounds of tensors if it thinks that out-of-bounds might happen. However, bound analysis is difficult, and this sometimes results in overapproximation. In this case TVM couldn't properly infer the bounds from the conditions (or it may have ignored the conditions completely, I'm not sure). I tried to fix this behavior of TVM once, but failed: #2104 (Note that there is another problem with this example: the identity tensor was not inlined, although this would have been very beneficial. The reason is that I don't perform inlining of adjoints since in the case of gradients the adjoints are rarely sparse and inlining them results in worse performance. However, in the case of arbitrary Jacobians, like in this case, inlining of adjoints is necessary, so I have to think what to do about it.) |
I've updated the PR with a Relay integration. Basically, calling |
@tqchen zero-elimination is independent and it makes sense to create a separate PR, I will do it soon. |
A link to my comment containing things to discuss before merging. |
@@ -206,6 +206,34 @@ Expr ReplaceTensor(Expr expr, | |||
} | |||
|
|||
|
|||
void ReplaceTensorRecursivelyImpl(Tensor tensor, | |||
std::unordered_map<Tensor, Tensor>* replace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use std::unordered_map<Tensor, Tensor, NodeHash, NodeEqual>*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The Tensor class has its own coarser notion of equality and it has its own specialization of std::hash
. I think using the ordinary reference equality may cause problems even for this particular function.
Zero elimination PR #2634 |
@sgrechanik-h @tqchen Hi, what is the status of this and #2634, #2588? In trying to implement the missing primal gradients for #2562 (to enable training), it seems like there will be a high per-operator time cost, so we'd like to understand where this AD (along with the other features) is in terms of progress. |
@altanh I've recently implemented a solver for integer systems of equations which improved performance of the code generated by the autodiff, however this triggered some bugs which are probably connected to missimplification by the Halide simplifier (due to different division semantics), so I'm currently moving the autodiff to the new tvm simplifiers. When I'm done, I'll update the pull requests. |
@sgrechanik-h that sounds great, thank you for the update! For now we'll continue working through the list of gradients and implementing them by hand, since the consensus is that they'll be invaluable for future things like optimizing after AD, higher order AD, etc. |
@sgrechanik-h Does this PR realize the linear inequality solver (to eliminate Kronecker delta within summation) described in the paper you mentioned? |
@yzhliu Yes, it does. The more up-to-date autodiff-dev branch also implements the solver for systems of linear equations (and should work on top of the current TVM master). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this is not possible since softmax is not an element-wise operation. Therefore, one would need to fully complete the forward softmax computation, in order to use it in the backward computation.
return Mul::make(Mutate(op->args[0]), e); | ||
} else if (op->name == "log") { | ||
return Div::make(Mutate(op->args[0]), op->args[0]); | ||
} else if (op->name == "sigmoid") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgrechanik-h I like this approach to express the gradient of sigmoid in terms of forward sigmoid (i.e. held in "e") I'm wondering how to extend this for softmax which is a bit trickier. The reason is that the gradient for softmax can also be expressed as forward softmax but it depends if 'i == j'
How would you express this here?
PS. the softmax gradient is taken from : https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since softmax is not elementwise, we can't express it here, so I think it should be done through the gradient overriding mechanism (or on the Relay level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Although softmax can be inside a large expression tree (e.g. focalloss) and as such, we'd want to autodiff that expression tree (using the Jacobian logic you have there), but when encountering the softmax operator, use the relay overridden softmax gradient instead. How would you invoke a relay registered gradient from the Jacobian function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are several problems:
- How to override differentiation rules for some specific tensor: in my implementation there is a mechanism for this, e.g. the option override which can be used as described here
- How to detect the operation for which we want to override gradients, in this case softmax. The overriding mechanism simply compares tensors by reference, but this equality seems to be too fine-grained in practice, because
sortmax(A)
andsoftmax(B)
are two different tensors on the tensor expression level. I think the practical solution would be to perform gradient overriding by the tensor's tag or name, however currently it is not implemented. - How to reuse gradients implemented in Relay. I don't know, but I think it should be relatively straightforward to lower a Relay graph representing a gradient into a tensor expression (however I have no idea what to do with schedules).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply! Took me a while to understand but I like your clever design that when override is specified, fdiff is essentially a wrapper. That is, depending on whether the out tensor is specified in the override or not, one would either use the default fdiff, or the one specified inside the override dictionary. However, I found your example below unintuitive:
# a generalization of my_fdiff which works for non-immediate dependencies
# this is necessary because z1 is not an immediate dep of z2 because of padding
def my_diff(out, inputs, head):
return tvm.differentiate(out, inputs, head, fdiff=my_fdiff)
# using a custom differentiation function only for z2
res = tvm.differentiate(y, [w1, w2], override={z2: ([z1, w2], my_diff)})
If I write my own custom fdiff for z2 w.r.t. z1, or w2, I'd prefer that my_diff returning a tensor (i.e. a tvm.compute that is the custom gradient) instead of DifferentiationResult. (not sure how to construct such object in python...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I jut realize that I can indeed do so, by putting the custom gradient tensors into a list and returning that from my_diff. DifferentiationResult is simply a list of tensors!
@sgrechanik-h what if you use a more traditional method of doing automatic differentiation - instead of working on sparse tensor, generate tvmir code with mutation and do the standard wengert list there. See Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition for example. |
@MarisaKirisame I guess it will result in a lots of |
@sgrechanik-h What's the current status of this project? Are there any updates? |
@merrymercy Unfortunately I don't have time to work on it now. The last time I rebased my branch on the latest TVM and checked that it still worked was 2 months ago (the branch is here: https://github.com/sgrechanik-h/tvm/tree/autodiff-dev). |
@sgrechanik-h would you mind if someone take over and rebuild it on top of you work? |
@yzhliu Yeah, absolutely, I don't mind. |
@sgrechanik-h I have difficulty understand this part, would be helpful if you can provide some guide, say we have
Then the for-loops in the code calculates |
@yzhliu |
@sgrechanik-h thanks for reply. that's the same as what I thought. if |
@yzhliu Oh, I guess |
Superseded by #5121 |
This PR adds automatic differentiation for tensor expressions (#1996). The implementation can be
naturally divided into two parts:
autodiff.cc
,autodiff.h
,autodiff.py
,test_pass_autodiff.py
). It contains an implementation of reverse-mode automatic differentiation (the functionDifferentiate
) and standard differentiation rules (the classJacobianMutator
).zero_elimination.cc
,zero_elimination.h
,test_pass_zero_elimination.py
). This part is called zero elimination because its goal is to eliminate generation of zero values in intermediate computations by fusing tensors and transforming domains of iteration. Its central part is Fourier-Motzkin elimination, currently we use our own implementation (the functionSolveSystemOfInequalities
), however in the future we might want to consider using some libraries for manipulating polyhedra (which may be useful for other parts of tvm as well).There are also several helpers: functions for printing out tensors recursively in a human-readable form (
dump_tensor.cc
) and functions for applying transformations to compute bodies (inop_util.cc
).Testing of the automatic differentiation is done in two ways:
Currently there are several thing missing from this PR: