Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc. benchmark regressions since 0.4 #16128

Closed
JeffBezanson opened this issue Apr 29, 2016 · 68 comments
Closed

misc. benchmark regressions since 0.4 #16128

JeffBezanson opened this issue Apr 29, 2016 · 68 comments
Assignees
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@JeffBezanson
Copy link
Member

JeffBezanson commented Apr 29, 2016

Just ran make test-perf and observed this. Most results are the same, and we've improved on about 6 tests, but unfortunately there are some significant regressions:

test 0.4 4f65737 4f65737 -O3 b029f50
parse_int 0.190 0.315 0.245
cons 80.837 91.86 90.14
gk 52.035 62.92 63.84
sparsemul 32.5 37.14 (threads)
sparsemul2 62.836 65.73 63.23 68.5
simplex 39.209 55.02 39.90 43.83
splitline 49.86 56.87 60.86 67.44
add1 33.63 44.09 37.82 36.6
add1_logical 41.797 105.3 60.12 45.4
binary_trees 25.5 29.64
k_nucleotide 61 75.34 (threads)
@JeffBezanson JeffBezanson added performance Must go faster regression Regression in behavior compared to a previous version labels Apr 29, 2016
@JeffBezanson JeffBezanson added this to the 0.5.0 milestone Apr 29, 2016
@JeffBezanson
Copy link
Member Author

There is also a significant regression in the spell benchmark.

0.4:

spell              10090.812 10533.558 10285.806  183.614

0.5:

spell              11888.845 12585.925 12231.564  297.824

I tried changing some of the containers to use String instead of AbstractString but it didn't seem to help much. Using SubString{String} as the key type of the Dict in train gives

spell              11395.126 11790.546 11550.910  169.565

slightly better but still slower than 0.4 and still 2.5x slower than python.

@StefanKarpinski
Copy link
Member

Some of the planned changes may improve this – in particular, the new Char type.

@tkelman
Copy link
Contributor

tkelman commented May 6, 2016

We should probably try to track down what led to the slowdown though - then any internal representation changes would be speedups, not just countering regressions elsewhere.

@hayd
Copy link
Member

hayd commented May 7, 2016

a retrospective daily (?) history of the perf tests might help, that way it could be seen whether it was single/few commits which were to blame. cc @jrevels

@JeffBezanson
Copy link
Member Author

Looks like #16236 improved things a bit, but not enough.

@JeffBezanson
Copy link
Member Author

Spell test after #16236:

spell              11070.436 11496.856 11296.590  204.591

@StefanKarpinski
Copy link
Member

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

@jrevels
Copy link
Member

jrevels commented May 9, 2016

It probably makes sense to run benchmarks on the version right before the String change to figure out how much of this is due to various other changes and how much is due to that change.

See here. Hopefully I got the commit range correct - I meant to do the most recent commit related to #16058 vs. the commit right before #16058 in the history.

@vtjnash
Copy link
Member

vtjnash commented May 17, 2016

I wonder how much of this can be traced back to #15997. I was trying to track down a recent performance regression of #15541 (thinking I had broken typemap), and found that @tkelman commit 5366586 was responsible for the ~15-20% performance regression in that test.

@JeffBezanson
Copy link
Member Author

Excellent point. I'll retry without threading.

@yuyichao
Copy link
Contributor

I expect threading to have a larger performance impact on the C runtime (e.g. #15541) than generated (especially runtime JITed code) since the optimization we use for TLS access doesn't really work for C code....

@vtjnash
Copy link
Member

vtjnash commented May 19, 2016

After #16439 is merged, it would be helpful to see an updated table with JULIA_THREADS=0. I think most of these have been addressed or are due to Box (such as the json test), which is being tracked separately.

@JeffBezanson
Copy link
Member Author

Measured again, still with threads enabled. Not many changes, but rand_mat_stat is much worse. k_nucleotide is so much faster that it's suspicious. Needs investigating.

@JeffBezanson
Copy link
Member Author

Tried with JULIA_THREADS = 0 and the differences don't seem significant. Would be good to try on other systems though.

@pabloferz
Copy link
Contributor

Does this changes too much with different LLVM versions?

@ViralBShah
Copy link
Member

ViralBShah commented Jun 2, 2016

rand_mat_stat slowdown seems to because of slowdowns in hcat and hvcat. Those don't show up in a profile running the code on 0.4.

julia> Profile.print()
498 ./REPL.jl:92; macro expansion
 498 ./REPL.jl:62; eval_user_input(::Any, ::Base.REP...
  498 ./boot.jl:225; eval(::Module, ::Any)
   498 ./profile.jl:16; macro expansion;
    3   ./REPL[1]:6; randmatstat(::Int64)
     1 ./random.jl:83; mt_pop!
     1 ./random.jl:1105; randn
     1 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    3   ./REPL[1]:7; randmatstat(::Int64)
     1 ./random.jl:1105; randn
     2 ./random.jl:1107; randn
      1 ./random.jl:1119; randn_unlikely(::MersenneTwist...
    119 ./REPL[1]:10; randmatstat(::Int64)
     10  ./abstractarray.jl:653; typed_hcat(::Type{T}, ::Array{...
     1   ./abstractarray.jl:658; typed_hcat(::Type{T}, ::Array{...
     101 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array{...
      5 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64,...
     1   ./abstractarray.jl:670; typed_hcat(::Type{T}, ::Array{...
     5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array{...
      1 ./array.jl:59; copy!(::Array{Float64,2}, ::Int6...
    283 ./REPL[1]:11; randmatstat(::Int64)
     11  ./abstractarray.jl:910; typed_hvcat(::Type{T}, ::Tuple...
     177 ./abstractarray.jl:914; typed_hvcat(::Type{T}, ::Tuple...
      134 ./array.jl:719; hcat(::Array{Float64,2}, ::Arr...
       129 ./abstractarray.jl:666; typed_hcat(::Type{T}, ::Array...
        4 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
       5   ./abstractarray.jl:672; typed_hcat(::Type{T}, ::Array...
        1 ./array.jl:58; copy!(::Array{Float64,2}, ::In...
      25  ./tuple.jl:10; getindex(::Tuple{Array{Float64...
     94  ./abstractarray.jl:917; typed_hvcat(::Type{T}, ::Tuple...
      5  ./abstractarray.jl:690; typed_vcat(::Type{T}, ::Array{...
      2  ./abstractarray.jl:691; typed_vcat(::Type{T}, ::Array{...
       2 ./reduce.jl:64; mapfoldl(::Base.##30#32, ::Func...
      71 ./abstractarray.jl:698; typed_vcat(::Type{T}, ::Array{...
       2 ./boot.jl:307; Array{Float64,N}(::Tuple{Int64...
      7  ./abstractarray.jl:703; typed_vcat(::Type{T}, ::Array{...
       1 ./multidimensional.jl:339; _setindex!(::Base.LinearFast, ...
       4 ./multidimensional.jl:340; _setindex!(::Base.LinearFast, ...
        2 ./multidimensional.jl:380; macro expansion
         1 ./operators.jl:420; setindex_shape_check(::Array{...
         1 ./operators.jl:424; setindex_shape_check(::Array{...
        1 ./multidimensional.jl:383; macro expansion
    28  ./REPL[1]:12; randmatstat(::Int64)
     1  ./linalg/dense.jl:148; trace(::Array{Float64,2})
     20 ./linalg/matmul.jl:148; At_mul_B!
      20 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,...
       16 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/blas.jl:1152; syrk!(::Char, ::Char, ::Float...
       1  ./linalg/matmul.jl:201; copytri!(::Array{Float64,2}, ...
       2  ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, ...
     1  ./linalg/matmul.jl:0; At_mul_B(::Array{Float64,2}, ::...
     4  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    27  ./REPL[1]:13; randmatstat(::Int64)
     12 ./linalg/dense.jl:172; ^(::Array{Float64,2}, ::Int64)
      8 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
       8 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
        8 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
      4 ./linalg/matmul.jl:129; *
     6  ./linalg/matmul.jl:148; At_mul_B!
      1 ./linalg/matmul.jl:255; syrk_wrapper!(::Array{Float64,2...
       1 ./abstractarray.jl:42; stride(::Array{Float64,2}, ::Int64)
      5 ./linalg/matmul.jl:256; syrk_wrapper!(::Array{Float64,2...
       2 ./linalg/blas.jl:1145; syrk!(::Char, ::Char, ::Float6...
       3 ./linalg/matmul.jl:203; copytri!(::Array{Float64,2}, :...
     5  ./linalg/matmul.jl:146; At_mul_B(::Array{Float64,2}, ::...
    32  ./linalg/dense.jl:172; ^
     22 ./intfuncs.jl:90; power_by_squaring(::Array{Float...
      1  ./linalg/matmul.jl:304; gemm_wrapper!(::Array{Float64,...
      1  ./linalg/matmul.jl:311; gemm_wrapper!(::Array{Float64,...
      2  ./linalg/matmul.jl:329; gemm_wrapper!(::Array{Float64,...
       2 ./abstractarray.jl:0; stride(::Array{Float64,2}, ::Int64)
      18 ./linalg/matmul.jl:330; gemm_wrapper!(::Array{Float64,...
       18 ./linalg/blas.jl:963; gemm!(::Char, ::Char, ::Float6...
     10 ./linalg/matmul.jl:129; *
    3   ./random.jl:1135; randn

@pkofod
Copy link
Contributor

pkofod commented Jun 3, 2016

In the following "mine" refers to commit e280a27

One commit before mine

julia,fib,0.042998,0.128659,0.046870,0.002392
julia,parse_int,0.242335,2.028559,0.268087,0.078824
julia,mandel,0.136463,0.198907,0.138009,0.003110
julia,quicksort,0.316253,0.580323,0.334022,0.013232
julia,pi_sum,40.930918,41.945148,41.060571,0.152101
julia,rand_mat_stat,17.376123,20.571832,18.228378,0.572528
julia,rand_mat_mul,44.568483,56.915150,47.338284,2.601669
julia,printfd,20.593159,21.428679,20.693425,0.141257
julia,micro.mem,270.566406,270.566406,270.566406,0.000000

mine

julia,fib,0.042982,0.124890,0.046680,0.002268
julia,parse_int,0.251515,1.852985,0.279358,0.079412
julia,mandel,0.136527,0.205742,0.138098,0.002956
julia,quicksort,0.313997,0.465756,0.332772,0.007824
julia,pi_sum,40.936211,41.703891,41.049687,0.113519
julia,rand_mat_stat,39.680477,46.019009,41.446778,1.222534
julia,rand_mat_mul,44.922529,55.156609,47.373832,2.114620
julia,printfd,20.579925,21.573189,20.672500,0.166164
julia,micro.mem,271.417969,271.417969,271.417969,0.000000

almost latest master (from yesterday I think)

julia,fib,0.044243,0.119606,0.046837,0.003259
julia,parse_int,0.289132,7.768136,0.322119,0.152336
julia,mandel,0.136763,0.250258,0.138992,0.006331
julia,quicksort,0.318064,0.534072,0.334189,0.016374
julia,pi_sum,40.511055,55.959972,46.687331,4.624115
julia,rand_mat_stat,50.049147,76.699907,59.645703,10.819121
julia,rand_mat_mul,71.592445,157.663542,89.907365,20.233438
julia,printfd,23.629916,42.501379,30.292729,5.762164
julia,micro.mem,268.761719,268.761719,268.761719,0.000000

It does seem that there's a doubling (ouch) from my commit. However, since then it seems to have gotten even worse (especially rand_mat_mul). This is just one run of test/perf/micro/perf.jl on each commit.

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2016

splatting penalty probably? some manually written out cases for small numbers of inputs may help?

@JeffBezanson
Copy link
Member Author

@pkofod Thanks for looking into it.

We need to be able to take advantage of the definition at

function hvcat{T}(rows::Tuple{Vararg{Int}}, as::AbstractMatrix{T}...)
, which is particularly fast.

@JeffBezanson
Copy link
Member Author

JeffBezanson commented Jun 3, 2016

Another way to look at it is that typed_hvcat is really slow, since it calls hcat on all the arguments followed by vcat. That could certainly be improved.

@timholy
Copy link
Member

timholy commented Jun 14, 2016

addendum: for some reason, putting @inbounds in front of the for-loop elids bounds checking inside next, but I think that's an inliner bug for the effect-free computation of getindex

If you added the addendum after the merger of #16260, note that next now has a @_propagate_inbounds_meta:

@JeffBezanson
Copy link
Member Author

JeffBezanson commented Jun 15, 2016

Progress! We're starting to do pretty well; several of the regressions are fixed, and most of the ones that remain are minor, around ~10%. Regressions marked (threads) are only a problem with threads enabled. Most of the numbers in the table are affected by enabling threads, but for these threading makes the difference between being a regression vs. not.

@StefanKarpinski
Copy link
Member

Shall we move this to 0.5.x then? (And make sure that this time we actually work though the .x issues.)

@timholy
Copy link
Member

timholy commented Jun 16, 2016

Has anyone done this yet, and does it work? @nanosoldier runbenchmarks(ALL, vs=":release-0.4")`

@timholy
Copy link
Member

timholy commented Jun 16, 2016

Missed a backtick, try again: @nanosoldier runbenchmarks(ALL, vs=":release-0.4")

@tkelman
Copy link
Contributor

tkelman commented Jun 16, 2016

Will that work on an issue? Try on the latest commit to master maybe?

jrevels referenced this issue Jun 16, 2016
try to simplify information flow around inference
@jrevels
Copy link
Member

jrevels commented Jun 16, 2016

It doesn't work in issues, since there's no explicit commit that can be assumed by context (though master would be a reasonable default). You can comment on commits though, as Tony mentioned. I triggered the job here.

@KristofferC
Copy link
Member

KristofferC commented Jun 16, 2016

I did a comparison between the two reports and this is a filtered (not completely exhaustive) summary of what is consistently slower. The number to the right is the slowdown.

String

["string","join"] 1.19
["string","replace"] 2.54

Sparse

["sparse","index",("spvec","integer",10000)] 1.81
["sparse","index",("spmat","array",1000)] 3.00
["sparse","index",("spmat","col","array",100)] 3.15

Sort

["sort","mergesort",("sortperm! reverse","random")] 1.23
and all other mergesort ones are consistently slower

["sort","issorted",("reverse","random")] 2.71
["sort","issorted",("reverse","ascending")] 2.85

Simd

["simd",("two_reductions","Float64",4096)] 1.29
["simd",("local_arrays","Int32",4096)] 27.83
["simd",("local_arrays","Float64",4096)] 18.57
["simd",("conditional_loop!","Float32",4095)] 2.95

Shootout

["shootout","revcomp"] 23.50
["shootout","fasta"] 1.24

Scalar

["scalar","fastmath",("add","Complex{Float32}")] 1.50
["scalar","fastmath",("add","Complex{Float64}")] 2.00

Basically all scalar arithmetic on BigInt and BigFloat are slower
["scalar","arithmetic",("div","Complex{BigFloat}","UInt64")]

Problem

["problem","stockcorr"] 1.23
["problem","ziggurat"] 1.22
["problem","simplex"] 1.36
["problem","laplacian","laplace_sparse_matvec"] 3.98

Parallel

["parallel","remotecall",("identity",1024)] 1.20
and the other remotecalls

LinAlg

Basically everything with tridiagonal and co, some examples:

["linalg","arithmetic",("-","Tridiagonal","Tridiagonal",256)] 1.45
["linalg","arithmetic",("-","Vector","Vector",256)] 1.44

["linalg","arithmetic",("-","Bidiagonal","Bidiagonal",1024)]

Array

["array","index",("sumrange","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")] 1.38
["array","index",("sumlogical","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}")] 1.20
and as well for other number types

["array","index",("sumlinear","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Int32,2}")] 1.82
["array","index",("sumelt_boundscheck","linspace(1.0,2.0,10000000)")] 2.06

["array","index",("sumcartesian","1:100000000")] 3.99
["array","growth",("push_single!",2048)] 1.78
["array","bool","bitarray_true_load!"] 1.17
["array","bool","boolarray_bool_load!"] 1.28

@vtjnash
Copy link
Member

vtjnash commented Jun 18, 2016

If you added the addendum after the merger of #16260, note that next now has a @_propagate_inbounds_meta:

Its good to know that's now there, but I checked my source tree before posting to confirm this was a known inliner bug, as it preceded that PR.

@JeffBezanson Is there some concrete actions required still to move this issue to v0.5.x, or what's left that seems unexplained / uncorrected?

@JeffBezanson
Copy link
Member Author

The regressions in splitline and gk are a bit worrying, but hopefully we can work on them in the RC period.

@JeffBezanson JeffBezanson modified the milestones: 0.5.x, 0.5.0 Jun 18, 2016
@timholy
Copy link
Member

timholy commented Jun 18, 2016

I'm more worried about the 27x regression on the @simd benchmarks. Anyone know why this loop:

function perf_local_arrays(V)
    # SIMD loop on local arrays declared without type annotations
    T, n = eltype(V), length(V)
    X = rand(T, n)
    Y = rand(T, n)
    Z = rand(T, n)
    @simd for i in eachindex(X)
        @inbounds X[i] = Y[i] * Z[i]
    end
    return X
end

doesn't vectorize? (Input: rand(Float64, 4096))

@ViralBShah
Copy link
Member

@ArchRobison ^

@yuyichao
Copy link
Contributor

It vectorizes on LLVM 3.8

@yuyichao
Copy link
Contributor

And on 3.7.1 it vectorizes too with -O3. When it doesn't, it's because of #15369. All the TBAA's are correctly attached in codegen and maintained by llvm (at least with #16897)

vtjnash added a commit that referenced this issue Jul 18, 2016
this makes for faster dynamic dispatch and apply-type construction

ref #16128
@tkelman
Copy link
Contributor

tkelman commented Aug 13, 2016

@timholy
Copy link
Member

timholy commented Aug 13, 2016

With the exception of the horrifying regressions on "simd", fairly satisfying overall. I'm especially pleased with the improvements in so many array benchmarks. Worried about the linalg regressions, but an optimist can always hope that my barrage of PRs this morning may help.

@KristofferC
Copy link
Member

KristofferC commented Aug 13, 2016

It feels like the array arithmetic is just some inlining problem or something like that.

mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
this makes for faster dynamic dispatch and apply-type construction

ref JuliaLang#16128
@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016
@JeffBezanson
Copy link
Member Author

Closing this, since at this point the experiment will need to be repeated for 0.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests