-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update AD comparison #988
Update AD comparison #988
Conversation
What about just keeping Tapir on the earlier version for now? |
I opened an issue at Tapir.jl, let's see what @willtebbutt thinks |
Since it sounds like per JuliaDiff/DifferentiationInterface.jl#339 and others things like closures DI is likely to have a nontrivial performance loss on Enzyme, can we just add Enzyme.gradient/autodiff benchmkars here directly |
The benchmarking utilities of DifferentiationInterfaceTest are one of the main appeals of the package, and so this notebook is a benchmark of AD backends as called through DI. I think it can also be useful for non-advanced users who want to use DI precisely because it is simple and backend-agnostic. Thus I would recommend we keep this page as it is, possibly with a warning. And I am aware that you're not happy about Enzyme being possibly a bit slower in DI @wsmoses ;). I just ask that you give me time to come up with a solution that makes everyone happy. In the meantime, I'm adding a warning to our docs: JuliaDiff/DifferentiationInterface.jl#344 |
We should measure the difference, that would make the discussions much more concrete. |
Yeah I think having both here would be useful. In this function in particular I imagine the perf would be mostly the same (given its already a single arg fn without a closure), but we could see how things actually shake up for the things being discussed |
Oh okay. I misunderstood this as a benchmark set of AD packages in general, rather than a DI-specific benchmark. If this is the case, maybe this should have a warning at the top saying that "This benchmark set is for testing the performance of DI and will have performance that may differs from using AD packages directly. If you are looking for a benchmark of different AD backends please see " |
I'm okay with things going either way. I agree that having a comparison with "native" Enzyme would make sense, especially in some complicated cases like closures. But I have already done the work of coding this backend-agnostic benchmarking pipeline, and so far it is the only thing close to a standard that we have. So my suggestion is the following: if AD developers are unhappy with the performance of their backend on a given benchmark, they are free to contribute the necessary code to add rows to the DataFrame in this notebook. Of course said rows should contain the exact same quantities that DifferentiationInterfaceTest computes by itself, to enable comparison. In other words, now that there is a standard, if you want to beat the DI implementation, the burden falls on you to play by the same rules. Or better yet, to contribute ideas to DI and help me fix performance pitfalls. What do you think? |
@gdalle is the DIT results basically compatible with benchmarktools? In which case I imagine it should be fairly simple to just add a @ benchmark / @ btime like below. As for the tangible perf implications, here's an example. Asking for the closure derivatives, like in JuliaDiff/DifferentiationInterface.jl#341 will cause the derivative code to become arbitrarily asymptotically worse (here demonstrated going from O(1) -> O(N)). The linked PR failed to compile for me so I demonstrate with what the Enzyme calls would be directly. As a top level comment, I simultaneously think the work you've done is immensely valuable (especially here to showcase a comparision), and also think it is important to provide reasonable default benchmarks/examples/docs/etc. People look to docs/benchmarks as examples as how things should be written/used and showing them something suboptimal or with an error has led to issues for Julia in the past (e.g. there was a Julia language doc for inbounds that itself misused inbounds, leading to lots of headaches -- and in various code examples of people using type unstable global variables leading to the use of type unstable global variables in a lot more places). If this is meant to be something which represents how to use AD in Julia [in contrast to how to use DI], I think it is important that we give examples that show how to use these tools most effectively (as well as how to use these tools most easily, thanks to DI!). struct MyClosure
y::Vector{Float64}
end
function (c::MyClosure)(x)
sum(c.y) + x
end
y = MyClosure(rand(10000))
using Enzyme
using DifferentiationInterface
using BenchmarkTools
@btime Enzyme.autodiff(Reverse, y, Active(1.0))
# 36.380 ns (1 allocation: 16 bytes)
@btime Enzyme.autodiff(Reverse, Duplicated(y, make_zero(y)), Active(1.0))
# 3.422 μs (11 allocations: 78.67 KiB)
using DifferentiationInterfaceTest
backends = [
AutoEnzyme(mode=Enzyme.Reverse),
];
scenarios = [
GradientScenario(x->y(x[1]); x=[1.0], y=1.0, nb_args=1),
];
# Using DI main
data = benchmark_differentiation(backends, scenarios, logging=true)
# 3×11 DataFrame
# Row │ backend scenario operator calls samples evals time allocs bytes gc_fraction compile_fraction
# │ AutoEnzy… Scenario… Symbol Int64 Int64 Int64 Float64 Float64 Float64 Float64 Float64
# ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# 1 │ AutoEnzyme(mode=ReverseMode{fals… Scenario{:gradient,1,:inplace} #… prepare_gradient 0 1 1 0.0 0.0 0.0 0.0 0.0
# 2 │ AutoEnzyme(mode=ReverseMode{fals… Scenario{:gradient,1,:inplace} #… value_and_gradient! 1 20198 1 3.375e-6 20.0 544.0 0.0 0.0
# 3 │ AutoEnzyme(mode=ReverseMode{fals… Scenario{:gradient,1,:inplace} #… gradient! 1 20006 1 3.375e-6 11.0 288.0 0.0 0.0
# Using Rebase of https://github.com/gdalle/DifferentiationInterface.jl/pull/341
# julia> using DifferentiationInterface, Enzyme
# Precompiling DifferentiationInterfaceEnzymeExt
# Info Given DifferentiationInterfaceEnzymeExt was explicitly requested, output will be shown live
# ERROR: LoadError: UndefVarError: `@compat` not defined |
Thanks Billy for the example, which indeed gives proof that JuliaDiff/DifferentiationInterface.jl#339 needs a little more thought. |
You're mixing two related but different concepts. This has nothing to do with JuliaDiff/DifferentiationInterface.jl#339. JuliaDiff/DifferentiationInterface.jl#339 needs to be done for correctness reasons. If you check @btime Enzyme.autodiff(Reverse, y, Active(1.0)) it simply gives the wrong derivatives in many cases. Yes it's fast, but: function (c::MyClosure)(x)
c.y[1] = x
sum(c.y) + x
end you'll get the wrong answer with The second is, DI should not build closures and instead should pass in const arguments so that what it's using for caching can parameters is not caught up in the closure duplication. That's the multi-argument form, and having this be a two argument function with a const on the second is the way to ensure it's clear not to differentiate the argument here. But I would be very thinking an approach is a "performance optimization" when it makes a very common use case simply compute the wrong answer. |
My thinking was that having multiple arguments would solve both problems at once, but you're right, people may still want to put stuff in the function |
There is no performance penalty in this case. There is a bug I'm trying to isolate, see SciML/SciMLSensitivity.jl#1067 , but I can verify that it's zero overhead. |
The case I'd worry about isn't the ghost type closure case, but one where there's inactive data being closed over. For example a model, where you just want the loss wrt the input and not the model [this would end up computing the derivative wrt everything]. And I do think that this is related to the multiple arguments/activity support. For example, @ChrisRackauckas your use case could be done by doing something like.
However if the function were always differentiable it would be impossible to represent the case where you don't differentiate wrt the function/model/etc, which is the default of other AD tools. I agree DI should support both, but in the absence of an option, in my mind is a matter of giving both reasonable defaults (e.g. what is expected from other AD tools). You can always use Enzyme directly. |
For multiple arguments and activities, check out my proposal in JuliaDiff/DifferentiationInterface.jl#311 (comment), I think I'm getting closer to an interesting interface. As for the default behavior, I think most other backends consider the function constant, so in that sense keeping the current behavior for Enzyme seems reasonable indeed, as long as I warn about it in the docs. Once multiple arguments are supported, closing over a cache is no longer necessary. |
The issue is that other AD backends throw an error. For example, we have this in the DiffEq FAQ because this comes up often. People will differentiate a code like that and get and error saying that there is a conversion of Float64 to Dual, undefined, etc. etc. and they will then ask or read the FAQ and then use PreallocationTools.jl to fix it. The issue with non-duplicated closures is that you simply get the wrong answer with no warning or error. SciML/SciMLSensitivity.jl#962 was done because on those examples if it's not done, you just get the wrong derivative. I would be okay with the approach of not handling it by default if there was an error or warning to disambiguate const vs active, but you really can't just assume that every user's function won't have active elements because that's not true, and it's so not true that we specifically have a section of the FAQ dedicated to it. DI shouldn't make closures and differentiate what's not needed, but making the assumption that the user's function is not active is just not always right, and it's also silent about the fact that it's not always right. |
But if we clearly document in DI that |
Every other backend throws an error though. I would rank it as:
All other AD backends do (2). Enzyme supports (1) or (3), but does not currently support (2) at least to my knowledge. Choosing 3 is not up to the standard we put on the other AD libraries. I understand wanting to go fast, but what about clearly documenting a proposed speedup boolean, so |
Yeah I had thought of amending ADTypes too. Perhaps |
I think an ADTypes arg is a good solution to this (also useful for others outside DI like ScimlSensitiviry etc). @ChrisRackauckas i completely agree with your comments/ordering, though I am going to push back on Enzyme giving an incorrect result. A user tells Enzyme if any var (including the function) is differentiable. Enzyme gives a correct answer in both cases, subject to the problem it was given (eg considering uses of the function to be constant mean any dataflow through a constant has 0 derivative). Of course there is still an issue here of “what problem do we want to compute” for which picking a non differentiable closure when we do want differentiable data flow is not correct. But the error is not in Enzyme, but on what picked the wrong Enzyme call. By a similar token if DI specified that the function is always non differentiable, it is not a bug in DI since that is using DI for a problem out of spec. The question at hand, however, is what should the spec be (eg a question for users of DI, one within DI, etc) |
I agree here. It's not an issue with Enzyme, but:
I think that's a misunderstanding of how DI is expected to be used. If we expect DI to be used in packages, say DiffEq, NonlinearSolve, Turing, etc. then you can't expect most users have read the documentation of DI at all. In fact, I'd argue that most users of DI won't even read most of the documentation either (unless there's an error message that tells them to read a specific page). The issue that I'm seeing is that DI as a general differentiation interface is planned to be used pervasively, but if it makes a choice like this, then this is a choice made for downstream as well. And it would be very confusing to a Turing user if they used a mutable cache to make their code non-allocating with pre-allocated buffers because this blog post https://viralinstruction.com/posts/optimise/#cut_down_on_allocations said to do that, and now their code does not converge correctly. At least today, a user of Turing will get an error message |
On this PR though, @gdalle should we merge with AutoTapir giving NaNs? |
I don't know. The Tapir issue is related to both DITest and Tapir using JET as a dependency, and some weird world age shenanigans happening (see compintell/Mooncake.jl#193). @willtebbutt your call |
I'm working on resolving this by removing JET as a dep of Tapir.jl. I'm expecting to have something merged today, then hopefully this will all work nicely, and you can use the most recent version of Tapir.jl. |
@gdalle could you please try restricting the version of Tapir to 0.2.26 and up, and see what happens? It should work now. |
Co-authored-by: Will Tebbutt <wt0881@my.bristol.ac.uk>
7559d5c
to
a6c7b0c
Compare
I updated the Project toml and rebased. |
Does it usually take an hour to find an agent for the AD testing? |
The sparse matrix run is 3 days, so that's occupying one. Then there's the bump here https://buildkite.com/julialang/scimlbenchmarks-dot-jl/builds/2671#0190dd16-0e34-4ef6-bb22-a261e6ae81d9 which should probably be done in a few hours. |
It looks like this went through and we're getting numbers. Merging. I'll probably tag a new set on the benchmarks soon too, so this might be the first tagged version. |
Update AD benchmark to the latest versions of DifferentiationInterface
For some reason Tapir fails, will investigate