misc. benchmark regressions since 0.4 #16128

JeffBezanson · 2016-04-29T20:14:47Z

Just ran make test-perf and observed this. Most results are the same, and we've improved on about 6 tests, but unfortunately there are some significant regressions:

test	0.4	`4f65737`	`4f65737` -O3	`b029f50`
parse_int	0.190	0.315		0.245
cons	80.837	91.86		90.14
gk	52.035	62.92		63.84
sparsemul	32.5			37.14 (threads)
sparsemul2	62.836	65.73	63.23	68.5
simplex	39.209	55.02	39.90	43.83
splitline	49.86	56.87	60.86	67.44
add1	33.63	44.09	37.82	36.6
add1_logical	41.797	105.3	60.12	45.4
binary_trees	25.5			29.64
k_nucleotide	61			75.34 (threads)

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2016-05-06T14:18:31Z

There is also a significant regression in the spell benchmark.

0.4:

spell              10090.812 10533.558 10285.806  183.614

0.5:

spell              11888.845 12585.925 12231.564  297.824

I tried changing some of the containers to use String instead of AbstractString but it didn't seem to help much. Using SubString{String} as the key type of the Dict in train gives

spell              11395.126 11790.546 11550.910  169.565

slightly better but still slower than 0.4 and still 2.5x slower than python.

StefanKarpinski · 2016-05-06T16:59:23Z

Some of the planned changes may improve this – in particular, the new Char type.

tkelman · 2016-05-06T23:09:57Z

We should probably try to track down what led to the slowdown though - then any internal representation changes would be speedups, not just countering regressions elsewhere.

hayd · 2016-05-07T01:45:32Z

a retrospective daily (?) history of the perf tests might help, that way it could be seen whether it was single/few commits which were to blame. cc @jrevels

JeffBezanson · 2016-05-07T04:18:44Z

Looks like #16236 improved things a bit, but not enough.

JeffBezanson · 2016-05-07T04:24:12Z

Spell test after #16236:

spell              11070.436 11496.856 11296.590  204.591

StefanKarpinski · 2016-05-09T08:39:27Z

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

jrevels · 2016-05-09T13:04:52Z

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

See here. Hopefully I got the commit range correct - I meant to do the most recent commit related to #16058 vs. the commit right before #16058 in the history.

vtjnash · 2016-05-17T18:55:18Z

I wonder how much of this can be traced back to #15997. I was trying to track down a recent performance regression of #15541 (thinking I had broken typemap), and found that @tkelman commit 5366586 was responsible for the ~15-20% performance regression in that test.

JeffBezanson · 2016-05-17T19:02:35Z

Excellent point. I'll retry without threading.

yuyichao · 2016-05-17T19:17:57Z

I expect threading to have a larger performance impact on the C runtime (e.g. #15541) than generated (especially runtime JITed code) since the optimization we use for TLS access doesn't really work for C code....

vtjnash · 2016-05-19T04:33:35Z

After #16439 is merged, it would be helpful to see an updated table with JULIA_THREADS=0. I think most of these have been addressed or are due to Box (such as the json test), which is being tracked separately.

JeffBezanson · 2016-05-29T19:08:59Z

Measured again, still with threads enabled. Not many changes, but rand_mat_stat is much worse. k_nucleotide is so much faster that it's suspicious. Needs investigating.

JeffBezanson · 2016-05-29T19:28:22Z

Tried with JULIA_THREADS = 0 and the differences don't seem significant. Would be good to try on other systems though.

pabloferz · 2016-06-01T15:53:41Z

Does this changes too much with different LLVM versions?

ViralBShah · 2016-06-02T17:11:19Z

rand_mat_stat slowdown seems to because of slowdowns in hcat and hvcat. Those don't show up in a profile running the code on 0.4.

julia> Profile.print()
498 ./REPL.jl:92; macro expansion
 498 ./REPL.jl:62; eval_user_input(::Any, ::Base.REP...
  498 ./boot.jl:225; eval(::Module, ::Any)
   498 ./profile.jl:16; macro expansion;
    3   ./REPL[1]:6; randmatstat(::Int64)
     1 ./random.jl:83; mt_pop!
     1 ./random.jl:1105; randn
     1 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    3   ./REPL[1]:7; randmatstat(::Int64)
     1 ./random.jl:1105; randn
     2 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    119 ./REPL[1]:10; randmatstat(::Int64)
     10  ./abstractarray.jl:653; typed_hcat(::Type{T}, ::Array{...
     1   ./abstractarray.jl:658; typed_hcat(::Type{T}, ::Array{...
     101 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array{...
      5 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64,...
     1   ./abstractarray.jl:670; typed_hcat(::Type{T}, ::Array{...
     5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array{...
      1 ./array.jl:59; copy!(::Array{Float64,2}, ::Int6...
    283 ./REPL[1]:11; randmatstat(::Int64)
     11  ./abstractarray.jl:910; typed_hvcat(::Type{T}, ::Tuple...
     177 ./abstractarray.jl:914; typed_hvcat(::Type{T}, ::Tuple...
      134 ./array.jl:719; hcat(::Array{Float64,2}, ::Arr...
       129 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array...
        4 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
       5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array...
        1 ./array.jl:58; copy!(::Array{Float64,2}, ::In...
      25  ./tuple.jl:10; getindex(::Tuple{Array{Float64...
     94  ./abstractarray.jl:917; typed_hvcat(::Type{T}, ::Tuple...
      5  ./abstractarray.jl:690; typed_vcat(::Type{T}, ::Array{...
      2  ./abstractarray.jl:691; typed_vcat(::Type{T}, ::Array{...
       2 ./reduce.jl:64; mapfoldl(::Base.##30#32, ::Func...
      71 ./abstractarray.jl:698; typed_vcat(::Type{T}, ::Array{...
       2 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
      7  ./abstractarray.jl:703; typed_vcat(::Type{T}, ::Array{...
       1 ./multidimensional.jl:339; _setindex!(::Base.LinearFast, ...
       4 ./multidimensional.jl:340; _setindex!(::Base.LinearFast, ...
        2 ./multidimensional.jl:380; macro expansion
         1 ./operators.jl:420; setindex_shape_check(::Array{...
         1 ./operators.jl:424; setindex_shape_check(::Array{...
        1 ./multidimensional.jl:383; macro expansion
    28  ./REPL[1]:12; randmatstat(::Int64)
     1  ./linalg/dense.jl:148; trace(::Array{Float64,2})
     20 ./linalg/matmul.jl:148; At_mul_B!
      20 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,...
       16 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/blas.jl:1152; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/matmul.jl:201; copytri!(::Array{Float64,2}, ...
       2  ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, ...
     1  ./linalg/matmul.jl:0; At_mul_B(::Array{Float64,2}, ::...
     4  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    27  ./REPL[1]:13; randmatstat(::Int64)
     12 ./linalg/dense.jl:172; ^(::Array{Float64,2}, ::Int64)
      8 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
       8 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
        8 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
      4 ./linalg/matmul.jl:129; *
     6  ./linalg/matmul.jl:148; At_mul_B!
      1 ./linalg/matmul.jl:255; syrk_wrapper!(::Array{Float64,2...
       1 ./abstractarray.jl:42; stride(::Array{Float64,2}, ::Int64)
      5 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,2...
       2 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float6...
       3 ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, :...
     5  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    32  ./linalg/dense.jl:172; ^
     22 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
      1  ./linalg/matmul.jl:304; gemm_wrapper!(::Array{Float64,...
      1  ./linalg/matmul.jl:311; gemm_wrapper!(::Array{Float64,...
      2  ./linalg/matmul.jl:329; gemm_wrapper!(::Array{Float64,...
       2 ./abstractarray.jl:0; stride(::Array{Float64,2}, ::Int64)
      18 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
       18 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
     10 ./linalg/matmul.jl:129; *
    3   ./random.jl:1135; randn

pkofod · 2016-06-03T11:35:37Z

In the following "mine" refers to commit e280a27

One commit before mine

julia,fib,0.042998,0.128659,0.046870,0.002392
julia,parse_int,0.242335,2.028559,0.268087,0.078824
julia,mandel,0.136463,0.198907,0.138009,0.003110
julia,quicksort,0.316253,0.580323,0.334022,0.013232
julia,pi_sum,40.930918,41.945148,41.060571,0.152101
julia,rand_mat_stat,17.376123,20.571832,18.228378,0.572528
julia,rand_mat_mul,44.568483,56.915150,47.338284,2.601669
julia,printfd,20.593159,21.428679,20.693425,0.141257
julia,micro.mem,270.566406,270.566406,270.566406,0.000000

mine

julia,fib,0.042982,0.124890,0.046680,0.002268
julia,parse_int,0.251515,1.852985,0.279358,0.079412
julia,mandel,0.136527,0.205742,0.138098,0.002956
julia,quicksort,0.313997,0.465756,0.332772,0.007824
julia,pi_sum,40.936211,41.703891,41.049687,0.113519
julia,rand_mat_stat,39.680477,46.019009,41.446778,1.222534
julia,rand_mat_mul,44.922529,55.156609,47.373832,2.114620
julia,printfd,20.579925,21.573189,20.672500,0.166164
julia,micro.mem,271.417969,271.417969,271.417969,0.000000

almost latest master (from yesterday I think)

julia,fib,0.044243,0.119606,0.046837,0.003259
julia,parse_int,0.289132,7.768136,0.322119,0.152336
julia,mandel,0.136763,0.250258,0.138992,0.006331
julia,quicksort,0.318064,0.534072,0.334189,0.016374
julia,pi_sum,40.511055,55.959972,46.687331,4.624115
julia,rand_mat_stat,50.049147,76.699907,59.645703,10.819121
julia,rand_mat_mul,71.592445,157.663542,89.907365,20.233438
julia,printfd,23.629916,42.501379,30.292729,5.762164
julia,micro.mem,268.761719,268.761719,268.761719,0.000000

It does seem that there's a doubling (ouch) from my commit. However, since then it seems to have gotten even worse (especially rand_mat_mul). This is just one run of test/perf/micro/perf.jl on each commit.

tkelman · 2016-06-03T11:37:15Z

splatting penalty probably? some manually written out cases for small numbers of inputs may help?

JeffBezanson · 2016-06-03T14:13:51Z

@pkofod Thanks for looking into it.

We need to be able to take advantage of the definition at

julia/base/abstractarray.jl

Line 799 in 7f95e1b

function hvcat{T}(rows::Tuple{Vararg{Int}}, as::AbstractMatrix{T}...)

, which is particularly fast.

JeffBezanson · 2016-06-03T14:16:05Z

Another way to look at it is that typed_hvcat is really slow, since it calls hcat on all the arguments followed by vcat. That could certainly be improved.

timholy · 2016-06-14T17:46:23Z

addendum: for some reason, putting @inbounds in front of the for-loop elids bounds checking inside next, but I think that's an inliner bug for the effect-free computation of getindex

If you added the addendum after the merger of #16260, note that next now has a @_propagate_inbounds_meta:

julia/base/array.jl

Line 303 in 2efec16

next(a::Array,i) = (@_propagate_inbounds_meta; (a[i],i+1))
julia/base/abstractarray.jl

Line 612 in 2efec16

next(A::AbstractArray,i) = (@_propagate_inbounds_meta; (idx, s) = next(i[1], i[2]); (A[idx], (i[1], s)))

JeffBezanson · 2016-06-15T21:01:27Z

Progress! We're starting to do pretty well; several of the regressions are fixed, and most of the ones that remain are minor, around ~10%. Regressions marked (threads) are only a problem with threads enabled. Most of the numbers in the table are affected by enabling threads, but for these threading makes the difference between being a regression vs. not.

StefanKarpinski · 2016-06-16T17:47:31Z

Shall we move this to 0.5.x then? (And make sure that this time we actually work though the .x issues.)

timholy · 2016-06-16T19:52:41Z

Has anyone done this yet, and does it work? @nanosoldier runbenchmarks(ALL, vs=":release-0.4")`

timholy · 2016-06-16T19:53:44Z

Missed a backtick, try again: @nanosoldier runbenchmarks(ALL, vs=":release-0.4")

tkelman · 2016-06-16T19:54:11Z

Will that work on an issue? Try on the latest commit to master maybe?

try to simplify information flow around inference

jrevels · 2016-06-16T20:16:47Z

It doesn't work in issues, since there's no explicit commit that can be assumed by context (though master would be a reasonable default). You can comment on commits though, as Tony mentioned. I triggered the job here.

KristofferC · 2016-06-16T22:52:24Z

I did a comparison between the two reports and this is a filtered (not completely exhaustive) summary of what is consistently slower. The number to the right is the slowdown.

String

["string","join"] 1.19
["string","replace"] 2.54

Sparse

["sparse","index",("spvec","integer",10000)] 1.81
["sparse","index",("spmat","array",1000)] 3.00
["sparse","index",("spmat","col","array",100)] 3.15

Sort

["sort","mergesort",("sortperm! reverse","random")] 1.23
and all other mergesort ones are consistently slower

["sort","issorted",("reverse","random")] 2.71
["sort","issorted",("reverse","ascending")] 2.85

Simd

["simd",("two_reductions","Float64",4096)] 1.29
["simd",("local_arrays","Int32",4096)] 27.83
["simd",("local_arrays","Float64",4096)] 18.57
["simd",("conditional_loop!","Float32",4095)] 2.95

Shootout

["shootout","revcomp"] 23.50
["shootout","fasta"] 1.24

Scalar

["scalar","fastmath",("add","Complex{Float32}")] 1.50
["scalar","fastmath",("add","Complex{Float64}")] 2.00

Basically all scalar arithmetic on BigInt and BigFloat are slower
["scalar","arithmetic",("div","Complex{BigFloat}","UInt64")]

Problem

["problem","stockcorr"] 1.23
["problem","ziggurat"] 1.22
["problem","simplex"] 1.36
["problem","laplacian","laplace_sparse_matvec"] 3.98

Parallel

["parallel","remotecall",("identity",1024)] 1.20
and the other remotecalls

LinAlg

Basically everything with tridiagonal and co, some examples:

["linalg","arithmetic",("-","Tridiagonal","Tridiagonal",256)] 1.45
["linalg","arithmetic",("-","Vector","Vector",256)] 1.44

["linalg","arithmetic",("-","Bidiagonal","Bidiagonal",1024)]

Array

["array","index",("sumrange","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")] 1.38
["array","index",("sumlogical","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")] 1.20
and as well for other number types

["array","index",("sumlinear","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Int32,2}")] 1.82
["array","index",("sumelt_boundscheck","linspace(1.0,2.0,10000000)")] 2.06

["array","index",("sumcartesian","1:100000000")] 3.99
["array","growth",("push_single!",2048)] 1.78
["array","bool","bitarray_true_load!"] 1.17
["array","bool","boolarray_bool_load!"] 1.28

vtjnash · 2016-06-18T03:58:29Z

If you added the addendum after the merger of #16260, note that next now has a @_propagate_inbounds_meta:

Its good to know that's now there, but I checked my source tree before posting to confirm this was a known inliner bug, as it preceded that PR.

@JeffBezanson Is there some concrete actions required still to move this issue to v0.5.x, or what's left that seems unexplained / uncorrected?

JeffBezanson · 2016-06-18T21:48:13Z

The regressions in splitline and gk are a bit worrying, but hopefully we can work on them in the RC period.

timholy · 2016-06-18T23:00:42Z

I'm more worried about the 27x regression on the @simd benchmarks. Anyone know why this loop:

function perf_local_arrays(V)
    # SIMD loop on local arrays declared without type annotations
    T, n = eltype(V), length(V)
    X = rand(T, n)
    Y = rand(T, n)
    Z = rand(T, n)
    @simd for i in eachindex(X)
        @inbounds X[i] = Y[i] * Z[i]
    end
    return X
end

doesn't vectorize? (Input: rand(Float64, 4096))

ViralBShah · 2016-06-18T23:03:04Z

@ArchRobison ^

yuyichao · 2016-06-18T23:18:46Z

It vectorizes on LLVM 3.8

yuyichao · 2016-06-18T23:22:24Z

And on 3.7.1 it vectorizes too with -O3. When it doesn't, it's because of #15369. All the TBAA's are correctly attached in codegen and maintained by llvm (at least with #16897)

this makes for faster dynamic dispatch and apply-type construction ref #16128

tkelman · 2016-08-13T09:53:22Z

for those who may not have seen it, here's 0.5.0-rc2 vs 0.4.6: https://github.com/JuliaCI/BaseBenchmarkReports/blob/6d82a5518a25740eef4abde8359ea3cdbc630375/0350e57_vs_2e358ce/report.md

timholy · 2016-08-13T14:35:03Z

With the exception of the horrifying regressions on "simd", fairly satisfying overall. I'm especially pleased with the improvements in so many array benchmarks. Worried about the linalg regressions, but an optimist can always hope that my barrage of PRs this morning may help.

KristofferC · 2016-08-13T14:42:38Z

It feels like the array arithmetic is just some inlining problem or something like that.

this makes for faster dynamic dispatch and apply-type construction ref JuliaLang#16128

JeffBezanson · 2017-01-20T22:54:59Z

Closing this, since at this point the experiment will need to be repeated for 0.6.

JeffBezanson added performance Must go faster regression Regression in behavior compared to a previous version labels Apr 29, 2016

JeffBezanson added this to the 0.5.0 milestone Apr 29, 2016

JeffBezanson mentioned this issue Apr 29, 2016

fix some cases where we incorrectly removed global var accesses #16106

Merged

tkelman mentioned this issue Apr 29, 2016

CI Performance Tracking for v0.5 #13893

Closed

4 tasks

jrevels referenced this issue May 9, 2016

reinstate "useless" String() inner constructor (#16212)

2309198

vtjnash mentioned this issue May 19, 2016

micro-optimizations for typemap #16439

Merged

vtjnash assigned JeffBezanson May 25, 2016

mschauer mentioned this issue Jun 1, 2016

Improve isprime performance for 32 bit integers JuliaMath/Primes.jl#9

Merged

tkelman mentioned this issue Jun 3, 2016

Change concatenation involving sparse matrices, sparse vectors and de… #16722

Merged

timholy mentioned this issue Jun 16, 2016

Handle setfield! in alloc_elim_pass #16969

Closed

jrevels referenced this issue Jun 16, 2016

Merge pull request #16929 from JuliaLang/jb/linfocopying

4accf8a

try to simplify information flow around inference

JeffBezanson modified the milestones: 0.5.x, 0.5.0 Jun 18, 2016

vtjnash mentioned this issue Jun 30, 2016

callsite union splitting #17212

Merged

vtjnash added a commit that referenced this issue Jul 18, 2016

ensure the array constructors are generally using leaftypes

cde97db

this makes for faster dynamic dispatch and apply-type construction ref #16128

KristofferC mentioned this issue Jul 20, 2016

Checklist of steps towards 0.5 release [candidates] #17418

Closed

16 tasks

mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016

ensure the array constructors are generally using leaftypes

d55a329

this makes for faster dynamic dispatch and apply-type construction ref JuliaLang#16128

timholy mentioned this issue Oct 25, 2016

Inference regression (julia 0.5) on types-as-arguments #19096

Closed

StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016

JeffBezanson closed this as completed Jan 20, 2017

misc. benchmark regressions since 0.4 #16128

misc. benchmark regressions since 0.4 #16128

Comments

JeffBezanson commented Apr 29, 2016 • edited Loading

JeffBezanson commented May 6, 2016

StefanKarpinski commented May 6, 2016

tkelman commented May 6, 2016

hayd commented May 7, 2016

JeffBezanson commented May 7, 2016

JeffBezanson commented May 7, 2016

StefanKarpinski commented May 9, 2016

jrevels commented May 9, 2016

vtjnash commented May 17, 2016

JeffBezanson commented May 17, 2016

yuyichao commented May 17, 2016

vtjnash commented May 19, 2016

JeffBezanson commented May 29, 2016

JeffBezanson commented May 29, 2016

pabloferz commented Jun 1, 2016

ViralBShah commented Jun 2, 2016 • edited Loading

pkofod commented Jun 3, 2016

tkelman commented Jun 3, 2016

JeffBezanson commented Jun 3, 2016

JeffBezanson commented Jun 3, 2016 • edited Loading

timholy commented Jun 14, 2016 • edited Loading

JeffBezanson commented Jun 15, 2016 • edited Loading

StefanKarpinski commented Jun 16, 2016

timholy commented Jun 16, 2016

timholy commented Jun 16, 2016

tkelman commented Jun 16, 2016

jrevels commented Jun 16, 2016

KristofferC commented Jun 16, 2016 • edited Loading

String

Sparse

Sort

Simd

Shootout

Scalar

Problem

Parallel

LinAlg

Array

vtjnash commented Jun 18, 2016

JeffBezanson commented Jun 18, 2016

timholy commented Jun 18, 2016

ViralBShah commented Jun 18, 2016

yuyichao commented Jun 18, 2016

yuyichao commented Jun 18, 2016

tkelman commented Aug 13, 2016

timholy commented Aug 13, 2016

KristofferC commented Aug 13, 2016 • edited Loading

JeffBezanson commented Jan 20, 2017

JeffBezanson commented Apr 29, 2016 •

edited

Loading

ViralBShah commented Jun 2, 2016 •

edited

Loading

JeffBezanson commented Jun 3, 2016 •

edited

Loading

timholy commented Jun 14, 2016 •

edited

Loading

JeffBezanson commented Jun 15, 2016 •

edited

Loading

KristofferC commented Jun 16, 2016 •

edited

Loading

KristofferC commented Aug 13, 2016 •

edited

Loading