Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in dynamic dispatch #15541

Closed
thofma opened this issue Mar 17, 2016 · 10 comments · Fixed by #16439
Closed

Performance regression in dynamic dispatch #15541

thofma opened this issue Mar 17, 2016 · 10 comments · Fixed by #16439
Assignees
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version

Comments

@thofma
Copy link
Contributor

thofma commented Mar 17, 2016

On latest master we noticed performance regression with the following simple code (This is quite similar to the regression reported in https://groups.google.com/forum/#!topic/julia-users/gZVTjUF_9E0 by @wbhart but does not involve gmp):

import Base: +

typealias ZZ Array{UInt, 1}

function +(a::ZZ, b::Int)
   r = ZZ(length(a))
   r[3] = r[3] + b
end

function doit(n::Int)
   a = ZZ(5)
   a[1] = UInt(2)
   a[2] = UInt(2)
   a[3] = UInt(2)
   a[4] = UInt(2)
   a[5] = UInt(2)

   for s = 1:n
      a += s
   end

   return a
end

@time doit(1000000000)

Here is the timing for 0.4.3

julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> @time doit(1000000000)
 31.858017 seconds (2.00 G allocations: 29.803 GB, 7.72% gc time)

Here is the timing for latest master:

julia> versioninfo()
Julia Version 0.5.0-dev+3182
Commit add0abe* (2016-03-17 05:03 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, nehalem)

julia> @time doit(1000000000)
 40.403079 seconds (2.00 G allocations: 29.803 GB, 5.60% gc time)
@KristofferC
Copy link
Member

Bisecting points to: 6b5a05e cc @JeffBezanson.

Does that make sense?

Nvm, I think I messed up the bisect script. Will look again

@tkelman tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Mar 17, 2016
@KristofferC
Copy link
Member

Ok, did it again properly and the cause seems to be in one of the commits for #13412

The LLVM after and before the PR for +(a::ZZ, b::Int) : https://gist.github.com/KristofferC/2e1335940ecee1fb6be5

I don't know if it is relevant but the following calls are for example not there before the PR:

%23 = call %jl_value_t* @jl_box_int64(i64 signext %22)
...
%24 = call %jl_value_t* @julia_convert.2(%jl_value_t* inttoptr (i64 139820670238864 to %jl_value_t*), %jl_value_t** %5, i32 2)
...
%25 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %4, i32 3)

@StefanKarpinski StefanKarpinski added performance Must go faster regression Regression in behavior compared to a previous version labels Mar 17, 2016
@yuyichao
Copy link
Contributor

yuyichao commented Apr 15, 2016

The extra call's are #15044 but is irrelevant to this issue since it is outside the loop.

Other than that, the issue seems to be that the excess GC frame stores and loads makes loops with allocation slightly slower (a few percent of total time at most) and the main issue seems to be that the dynamic dispatch is mush slower (>30%) after jb/functions.

@yuyichao yuyichao changed the title Performance regression Performance regression in dynamic dispatch Apr 15, 2016
@yuyichao
Copy link
Contributor

yuyichao commented Apr 15, 2016

Script used for benchmarking

@noinline function bare_gc(n::Int)
    for s = 1:n
        ccall(:jl_gc_alloc_1w, Ptr{Void}, ())
    end
end

@noinline function with_gc_frame(n::Int)
    for s = 1:n
        Ref(s)
    end
end

@noinline f(a, b) = b

@noinline function unstable_no_alloc(n::Int)
    a = Ref(1)
    b = Ref(1.0)
    for s = 1:n
        a = f(a, b)
    end
    a
end

function main()
    @time bare_gc(4_600_000_000)
    @time with_gc_frame(4_600_000_000)
    @time unstable_no_alloc(4_600_000_000)
end

main()

Difference between results on master and release-0.4

--- timing-0.4  2016-04-15 19:06:23.230300849 -0400
+++ timing-master       2016-04-15 19:06:23.230300849 -0400
@@ -1,3 +1,3 @@
-  9.611569 seconds (4.60 G allocations: 68.545 GB, 8.67% gc time)
+  9.444634 seconds (4.60 G allocations: 68.545 GB, 9.20% gc time)
- 11.631434 seconds (4.60 G allocations: 68.545 GB, 6.97% gc time)
+ 12.529561 seconds (4.60 G allocations: 68.545 GB, 6.45% gc time)
- 35.423600 seconds (53 allocations: 3.453 KB)
+ 58.950526 seconds (33 allocations: 2.078 KB)

Each second is roughly 1 CPU cycle per loop. The time fluctuation are ~0.2s on the first two tests and 2-3s on the third test. So the gc frame causes a 1 cycle regression per loop which is no more than 3% of the total time (of the third one, even less in the original one) but the time spent in jl_apply_generic (3rd test) is more than 60% longer.

@yuyichao
Copy link
Contributor

Also note that the code in this issue is not an accurate representation of the code reported on the julia-users list. The original code (from the mailing list) is type stable while the code in this report is not.

@yuyichao
Copy link
Contributor

In additional to the type instability, the incorrect definition of the custom + (and the use of + at all...) also cause the code in this report to not spend any time in the custom + at all and therefore not creating any new arrays. The regression in the original code is actually #15044 . The proof is that the difference in performance disappeared when using ccall(:jl_alloc_array_1d) to allocated the array.

@yuyichao yuyichao added the types and dispatch Types, subtyping and method dispatch label Apr 15, 2016
@JeffBezanson JeffBezanson removed the types and dispatch Types, subtyping and method dispatch label Apr 15, 2016
@vtjnash
Copy link
Member

vtjnash commented Apr 18, 2016

some of the difference was recovered by #15779:

  | | |_| | | | (_| |  |  Version 0.4.3 (2016-01-12 21:37 UTC)
julia> main()
 23.209517 seconds (4.60 G allocations: 68.545 GB, 13.34% gc time)
 25.301924 seconds (4.60 G allocations: 68.545 GB, 11.80% gc time)
 72.551418 seconds (53 allocations: 3.453 KB)

 _/ |\__'_|_|_|\__'_|  |  Commit 61f2d2b* (12 days old master)
julia> main()
 28.933383 seconds (4.60 G allocations: 68.545 GB, 11.25% gc time)
 33.202012 seconds (4.60 G allocations: 68.545 GB, 9.95% gc time)
112.725341 seconds (33 allocations: 2.078 KB)

 _/ |\__'_|_|_|\__'_|  |  Commit 2aa5d4d* (0 days old master)
julia> main()
 28.117986 seconds (4.60 G allocations: 68.545 GB, 11.30% gc time)
 32.764293 seconds (4.60 G allocations: 68.545 GB, 9.83% gc time)
 95.710122 seconds (32 allocations: 1.781 KB)

@vtjnash
Copy link
Member

vtjnash commented Jun 22, 2016

on master today (with llvm-3.3):

JULIA_THREADS=0
 29.302094 seconds (4.60 G allocations: 68.545 GB, 9.76% gc time)
  0.000000 seconds
 60.803695 seconds (2 allocations: 32 bytes)

default build:
julia> main()
 54.611353 seconds (4.60 G allocations: 68.545 GB, 5.14% gc time)
  0.000000 seconds
 73.095928 seconds (16 allocations: 960 bytes)

@yuyichao
Copy link
Contributor

yuyichao commented Jun 22, 2016

Is this the same machine as #15541 (comment) ? (so it's faster with threading off and the same with threading on?)

@vtjnash
Copy link
Member

vtjnash commented Jun 22, 2016

yes, should be the same machine so these numbers can be directly compared

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants