-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in dynamic dispatch #15541
Comments
Nvm, I think I messed up the bisect script. Will look again |
Ok, did it again properly and the cause seems to be in one of the commits for #13412 The LLVM after and before the PR for I don't know if it is relevant but the following calls are for example not there before the PR:
|
The extra call's are #15044 but is irrelevant to this issue since it is outside the loop. Other than that, the issue seems to be that the excess GC frame stores and loads makes loops with allocation slightly slower (a few percent of total time at most) and the main issue seems to be that the dynamic dispatch is mush slower (>30%) after jb/functions. |
@noinline function bare_gc(n::Int)
for s = 1:n
ccall(:jl_gc_alloc_1w, Ptr{Void}, ())
end
end
@noinline function with_gc_frame(n::Int)
for s = 1:n
Ref(s)
end
end
@noinline f(a, b) = b
@noinline function unstable_no_alloc(n::Int)
a = Ref(1)
b = Ref(1.0)
for s = 1:n
a = f(a, b)
end
a
end
function main()
@time bare_gc(4_600_000_000)
@time with_gc_frame(4_600_000_000)
@time unstable_no_alloc(4_600_000_000)
end
main() Difference between results on master and release-0.4 --- timing-0.4 2016-04-15 19:06:23.230300849 -0400
+++ timing-master 2016-04-15 19:06:23.230300849 -0400
@@ -1,3 +1,3 @@
- 9.611569 seconds (4.60 G allocations: 68.545 GB, 8.67% gc time)
+ 9.444634 seconds (4.60 G allocations: 68.545 GB, 9.20% gc time)
- 11.631434 seconds (4.60 G allocations: 68.545 GB, 6.97% gc time)
+ 12.529561 seconds (4.60 G allocations: 68.545 GB, 6.45% gc time)
- 35.423600 seconds (53 allocations: 3.453 KB)
+ 58.950526 seconds (33 allocations: 2.078 KB) Each second is roughly 1 CPU cycle per loop. The time fluctuation are ~0.2s on the first two tests and 2-3s on the third test. So the gc frame causes a 1 cycle regression per loop which is no more than 3% of the total time (of the third one, even less in the original one) but the time spent in |
Also note that the code in this issue is not an accurate representation of the code reported on the julia-users list. The original code (from the mailing list) is type stable while the code in this report is not. |
In additional to the type instability, the incorrect definition of the custom |
some of the difference was recovered by #15779:
|
on master today (with llvm-3.3):
|
Is this the same machine as #15541 (comment) ? (so it's faster with threading off and the same with threading on?) |
yes, should be the same machine so these numbers can be directly compared |
On latest master we noticed performance regression with the following simple code (This is quite similar to the regression reported in https://groups.google.com/forum/#!topic/julia-users/gZVTjUF_9E0 by @wbhart but does not involve gmp):
Here is the timing for 0.4.3
Here is the timing for latest master:
The text was updated successfully, but these errors were encountered: