-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide inline-worthiness based on a more nuanced cost model #22210
Conversation
Just realized there's a really big hole in this: a |
Not an expert on inlining heuristic but I think using the number of operation in a function to decide inline-worthiness should be OK. It's perfectly fine to inline a small loop. As long as the inlining doesn't blow up the size of the parent function it shouldn't hurt much by inlining it. Obviously it might not help much but that's true in general and you probably need a much more involved cost model of the inlining or actually trying to inline it to tell that. |
The change to `isassigned(::SimpleVector, ...)` and `isassigned(::BitArray, ...)` are currently no op since they will unlikely be inlined. The `SimpleVector` version is also currently unsafe to inline. These might change with a better inlining heuristic (#22210) and a smarter GC frame allocation pass (#21888).
e3c3426
to
4818fdd
Compare
4818fdd
to
bf17edb
Compare
Am I reading it correctly that there's one case that shows 10^6 speed up?
Well, then all optimizing compilers have been cheating all along.... I'm pretty sure most C/C++ compiler do much more aggressive inlining than we do so I wouldn't call it cheating. The important thing is to measure the overhead. Maybe measuring the time to compile sysimg, or some other standard precompilation test cases. You can also measure the total size in inferenced IR by record a statistic of non-void expression in inference after finishing each functions and if you want more low level info you can also check |
Wow, that's really cool. If you re-enable manual inlining declarations, do the regressions go away? |
Yes, that's basically right; several others cases get better than 10^4 speed up. Or rather, "not slowed down"; I am sure that our performance annotations take care of all the egregious cases. I very much doubt we'll see much improvement compared to "real" master from this, but of course it's worth testing.
Great idea.
So there's definitely a slowdown with this. Would be interested in people's thoughts on how bad this is.
Will test. But right now I'm testing how this fares against "real" master, with all of its performance annotations; I'm curious to know if we could scale back on our usage of |
I believe we also have a |
Here's a comparison against master with all of its Definitely worse on balance, but better than I expected: to my eye, this branch without performance annotations is closer to master with annotations than it is to master with the annotations stripped out. Obviously some of the regressions are pretty severe, but I've done literally no performance tuning here. Probably worth looking into some of the regressions by hand and seeing what I can come up with...in particular, the ones that have ~10-100ns of runtime on master seem likely to be just me assigning the wrong cost and it therefore failing to inline a pretty simple function. |
I think it'd be really cool to expose the internal cost model here via profiling; I think it'd be a neat alternative to time/allocations/samples and another good signal in terms of identifying code performance bottlenecks and the relative "cost" of various code paths. Sometimes you don't realize the relative costs of various parts of a workflow or have some kind of type instability, all of which seems like would stick out with seeing this kind of data. |
@quinnj, the cost model is almost certainly not accurate enough to expose to users; it doesn't take into account looping, and the cost of non-inferrability is handled crudely. However, it can be useful to understand why a particular function didn't get inlined. I'll document how to call |
bf17edb
to
0d6e686
Compare
0d6e686
to
c981dd1
Compare
So, I'm thinking of putting this on the back burner until #21888 merges; I'm seeing quite a few issues that I hope get resolved by better GC marking. The last commit here (c981dd1) is just one example; I'm also seeing a few big performance regressions that seem to occur because we allocate when we don't need to. A good example is Of course the question is whether its benefits on LLVM 3.9 will fix these issues, or whether it will have to wait until we move to LLVM 5. |
Speaking of bugs triggered by this, https://travis-ci.org/JuliaLang/julia/builds/240808331 hit an assertion in |
c981dd1
to
9fcb93c
Compare
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
Cool stuff! |
9fcb93c
to
b6f9a51
Compare
This greatly reduces the cost of failure-to-inline
This switches to a model in which we estimate the runtime of the function; if it's fast, then we should inline it. The estimate of runtime is extremely crude, and doesn't even take loops into account.
abcdc77
to
7de02a3
Compare
If this makes it through CI, I'll merge in the morning. |
Appveyor is failing due to libgit2-online. Do all the intermediate commits pass tests as well? |
Yes, they do. (EDIT: only the last commit addresses the real topic of this PR. The earlier ones compensate for issues I detected at various points along the way while developing this. As with the |
Yay! |
Hm, looks like this makes the ACME.jl tests crash. Not sure if the problem was introduced here or just uncovered by different inlining. I haven't really gotten anywhere investigating the issue, but somewhat up the call stack of a sightly reduced case, I found this:
Note that the type of (inner) args is an instance of |
As long as it doesn't crash in However, it's perfectly capable of exposing old bugs. Indeed it's already done so: at least #22516, and (at an intermediate stage) what is probably the same bug as #22592. I've not seen that one before, though. |
Seems indeed to be a missing GC root again, ref. #22770 (comment). |
Sometime later then if you have time? |
In #22202 I noted a disadvantage of our current inlining heuristics: that we penalize verbosity even when two bits of code do the same thing. I decided to explore the feasibility of replacing our cost model with something based a little more closely on CPU cycles, with the idea that we should inline things that are obviously cheaper than a function call. I decided to be a bit generous and set a default threshold of 200 "cycles", and assigned costs to operations loosely inspired by this chart.
This is a horribly rough approximation: for example, there's a very large difference between success and failure when it comes to branch-prediction, and it might be reasonable to guess that the branch in
is much less likely to be hit than
Such subtleties (not to mention cache prefetch, etc) are far beyond the simple approach here. In general I tried to aim for an optimistic model, but worked in some amount of penalty for the possibility that this wouldn't hold, with the amount depending entirely on how I was feeling at that particular moment 😄.
It's also worth pointing out that I've done nothing whatsoever to test how well this works. My entire goal was to get julia to build.
This still has debugging statements in the submitted code because I can't imagine that this isn't going to need at least a teensy bit of revision 😉.