Performance regression in dynamic dispatch #15541

thofma · 2016-03-17T09:56:12Z

On latest master we noticed performance regression with the following simple code (This is quite similar to the regression reported in https://groups.google.com/forum/#!topic/julia-users/gZVTjUF_9E0 by @wbhart but does not involve gmp):

import Base: +

typealias ZZ Array{UInt, 1}

function +(a::ZZ, b::Int)
   r = ZZ(length(a))
   r[3] = r[3] + b
end

function doit(n::Int)
   a = ZZ(5)
   a[1] = UInt(2)
   a[2] = UInt(2)
   a[3] = UInt(2)
   a[4] = UInt(2)
   a[5] = UInt(2)

   for s = 1:n
      a += s
   end

   return a
end

@time doit(1000000000)

Here is the timing for 0.4.3

julia> versioninfo()
Julia Version 0.4.3
Commit a2f713d (2016-01-12 21:37 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> @time doit(1000000000)
 31.858017 seconds (2.00 G allocations: 29.803 GB, 7.72% gc time)

Here is the timing for latest master:

julia> versioninfo()
Julia Version 0.5.0-dev+3182
Commit add0abe* (2016-03-17 05:03 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, nehalem)

julia> @time doit(1000000000)
 40.403079 seconds (2.00 G allocations: 29.803 GB, 5.60% gc time)

KristofferC · 2016-03-17T11:47:45Z

~~Bisecting points to: 6b5a05e cc @JeffBezanson.~~

~~Does that make sense?~~

Nvm, I think I messed up the bisect script. Will look again

KristofferC · 2016-03-17T13:02:38Z

Ok, did it again properly and the cause seems to be in one of the commits for #13412

The LLVM after and before the PR for +(a::ZZ, b::Int) : https://gist.github.com/KristofferC/2e1335940ecee1fb6be5

I don't know if it is relevant but the following calls are for example not there before the PR:

%23 = call %jl_value_t* @jl_box_int64(i64 signext %22)
...
%24 = call %jl_value_t* @julia_convert.2(%jl_value_t* inttoptr (i64 139820670238864 to %jl_value_t*), %jl_value_t** %5, i32 2)
...
%25 = call %jl_value_t* @jl_apply_generic(%jl_value_t** %4, i32 3)

yuyichao · 2016-04-15T22:35:25Z

The extra call's are #15044 but is irrelevant to this issue since it is outside the loop.

Other than that, the issue seems to be that the excess GC frame stores and loads makes loops with allocation slightly slower (a few percent of total time at most) and the main issue seems to be that the dynamic dispatch is mush slower (>30%) after jb/functions.

yuyichao · 2016-04-15T23:10:58Z

Script used for benchmarking

@noinline function bare_gc(n::Int)
    for s = 1:n
        ccall(:jl_gc_alloc_1w, Ptr{Void}, ())
    end
end

@noinline function with_gc_frame(n::Int)
    for s = 1:n
        Ref(s)
    end
end

@noinline f(a, b) = b

@noinline function unstable_no_alloc(n::Int)
    a = Ref(1)
    b = Ref(1.0)
    for s = 1:n
        a = f(a, b)
    end
    a
end

function main()
    @time bare_gc(4_600_000_000)
    @time with_gc_frame(4_600_000_000)
    @time unstable_no_alloc(4_600_000_000)
end

main()

Difference between results on master and release-0.4

--- timing-0.4  2016-04-15 19:06:23.230300849 -0400
+++ timing-master       2016-04-15 19:06:23.230300849 -0400
@@ -1,3 +1,3 @@
-  9.611569 seconds (4.60 G allocations: 68.545 GB, 8.67% gc time)
+  9.444634 seconds (4.60 G allocations: 68.545 GB, 9.20% gc time)
- 11.631434 seconds (4.60 G allocations: 68.545 GB, 6.97% gc time)
+ 12.529561 seconds (4.60 G allocations: 68.545 GB, 6.45% gc time)
- 35.423600 seconds (53 allocations: 3.453 KB)
+ 58.950526 seconds (33 allocations: 2.078 KB)

Each second is roughly 1 CPU cycle per loop. The time fluctuation are ~0.2s on the first two tests and 2-3s on the third test. So the gc frame causes a 1 cycle regression per loop which is no more than 3% of the total time (of the third one, even less in the original one) but the time spent in jl_apply_generic (3rd test) is more than 60% longer.

yuyichao · 2016-04-15T23:16:10Z

Also note that the code in this issue is not an accurate representation of the code reported on the julia-users list. The original code (from the mailing list) is type stable while the code in this report is not.

yuyichao · 2016-04-15T23:26:14Z

In additional to the type instability, the incorrect definition of the custom + (and the use of + at all...) also cause the code in this report to not spend any time in the custom + at all and therefore not creating any new arrays. The regression in the original code is actually #15044 . The proof is that the difference in performance disappeared when using ccall(:jl_alloc_array_1d) to allocated the array.

vtjnash · 2016-04-18T22:35:05Z

some of the difference was recovered by #15779:

  | | |_| | | | (_| |  |  Version 0.4.3 (2016-01-12 21:37 UTC)
julia> main()
 23.209517 seconds (4.60 G allocations: 68.545 GB, 13.34% gc time)
 25.301924 seconds (4.60 G allocations: 68.545 GB, 11.80% gc time)
 72.551418 seconds (53 allocations: 3.453 KB)

 _/ |\__'_|_|_|\__'_|  |  Commit 61f2d2b* (12 days old master)
julia> main()
 28.933383 seconds (4.60 G allocations: 68.545 GB, 11.25% gc time)
 33.202012 seconds (4.60 G allocations: 68.545 GB, 9.95% gc time)
112.725341 seconds (33 allocations: 2.078 KB)

 _/ |\__'_|_|_|\__'_|  |  Commit 2aa5d4d* (0 days old master)
julia> main()
 28.117986 seconds (4.60 G allocations: 68.545 GB, 11.30% gc time)
 32.764293 seconds (4.60 G allocations: 68.545 GB, 9.83% gc time)
 95.710122 seconds (32 allocations: 1.781 KB)

vtjnash · 2016-06-22T16:06:57Z

on master today (with llvm-3.3):

JULIA_THREADS=0
 29.302094 seconds (4.60 G allocations: 68.545 GB, 9.76% gc time)
  0.000000 seconds
 60.803695 seconds (2 allocations: 32 bytes)

default build:
julia> main()
 54.611353 seconds (4.60 G allocations: 68.545 GB, 5.14% gc time)
  0.000000 seconds
 73.095928 seconds (16 allocations: 960 bytes)

yuyichao · 2016-06-22T16:46:32Z

Is this the same machine as #15541 (comment) ? (so it's faster with threading off and the same with threading on?)

vtjnash · 2016-06-22T17:25:46Z

yes, should be the same machine so these numbers can be directly compared

tkelman added the potential benchmark Could make a good benchmark in BaseBenchmarks label Mar 17, 2016

StefanKarpinski added performance Must go faster regression Regression in behavior compared to a previous version labels Mar 17, 2016

yuyichao changed the title ~~Performance regression~~ Performance regression in dynamic dispatch Apr 15, 2016

yuyichao added the types and dispatch Types, subtyping and method dispatch label Apr 15, 2016

JeffBezanson removed the types and dispatch Types, subtyping and method dispatch label Apr 15, 2016

yuyichao mentioned this issue Apr 16, 2016

TupleMap type #15779

Merged

vtjnash mentioned this issue May 17, 2016

misc. benchmark regressions since 0.4 #16128

Closed

vtjnash self-assigned this May 18, 2016

vtjnash mentioned this issue May 19, 2016

micro-optimizations for typemap #16439

Merged

vtjnash closed this as completed in #16439 May 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in dynamic dispatch #15541

Performance regression in dynamic dispatch #15541

thofma commented Mar 17, 2016

KristofferC commented Mar 17, 2016

KristofferC commented Mar 17, 2016

yuyichao commented Apr 15, 2016 •

edited

Loading

yuyichao commented Apr 15, 2016 •

edited

Loading

yuyichao commented Apr 15, 2016

yuyichao commented Apr 15, 2016

vtjnash commented Apr 18, 2016

vtjnash commented Jun 22, 2016

yuyichao commented Jun 22, 2016 •

edited

Loading

vtjnash commented Jun 22, 2016

Performance regression in dynamic dispatch #15541

Performance regression in dynamic dispatch #15541

Comments

thofma commented Mar 17, 2016

KristofferC commented Mar 17, 2016

KristofferC commented Mar 17, 2016

yuyichao commented Apr 15, 2016 • edited Loading

yuyichao commented Apr 15, 2016 • edited Loading

yuyichao commented Apr 15, 2016

yuyichao commented Apr 15, 2016

vtjnash commented Apr 18, 2016

vtjnash commented Jun 22, 2016

yuyichao commented Jun 22, 2016 • edited Loading

vtjnash commented Jun 22, 2016

yuyichao commented Apr 15, 2016 •

edited

Loading

yuyichao commented Apr 15, 2016 •

edited

Loading

yuyichao commented Jun 22, 2016 •

edited

Loading