performance regressions since 0.2 #6112

JeffBezanson · 2014-03-11T08:09:43Z

Some of these are quite bad.

0.3    printfd              34.571   
0.2    printfd              25.737   

0.3    sparsemul           118.917  
0.2    sparsemul           105.659  

0.3    stockcorr           653.166 
0.2    stockcorr           447.858  

0.3    bench_eu_vec        122.564  
0.2    bench_eu_vec        111.484  

0.3    actorgraph          836.588  
0.2    actorgraph          647.293  

0.3    laplace_vec        1298.435 
0.2    laplace_vec        1088.022 

0.3    k_nucleotide         96.685  
0.2    k_nucleotide         72.413   

0.3    meteor_contest     3757.757 
0.2    meteor_contest     3494.511

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2014-03-12T00:13:30Z

Some findings:

in stockcorr, ^ is no longer getting inlined (fairly sure this would be fixed by more inlining #3796)
in actorgraph it appears that Dict setindex! might have gotten slower
Dict performance is also an issue in k_nucleotide
regression in performance of counting loops #5469 is a big part of laplace_vec

I also see slightly slower performance when starting with sys.so. Could be #5125, or possibly the way ccalls are generated.

JeffBezanson · 2014-03-13T20:11:46Z

Another finding: in 0.2 llvm was able to optimize out sqrt of a constant, and convert pow(x,2) to x*x. The way ccalls are generated for static compilation blocks these optimizations. Using llvm's intrinsic sqrt and pow might fix this (#5983).

restores the performance of stockcorr

vtjnash · 2014-04-26T06:23:53Z

most of these look pretty good now, comparing 0.2.1 to master (+ #6599 patch, tested at many inlining thresholds):

these two look much worse (100% slower):
bench_eu_vec
quicksort

these two look slightly worse (10% slower):
bench_eu_devec
stockcorr

the rest seem about the same (within experimental tolerance) or faster

quicksort seems to have taken a 50% hit to speed recently. it looks like the only difference in code_typed is the order of lowering for the while loop conditions, which appears to be confusing llvm into emitting unnecessary copies of the conditional. @JeffBezanson?

also, strangely, the performance of small and large on the following tests swapped:
hvcat_small
hvcat_large
hvcat_setind_small
hvcat_setind_large
hcat_small
hcat_large
hcat_setind_small
hcat_setind_large
vcat_small
vcat_large
vcat_setind_small
vcat_setind_large
I'm not sure if this is real, or a change in the tests.

blakejohnson · 2014-05-08T17:44:08Z

It still looks like we have significant performance regressions almost across the board of the test suite. For example, add1 is about 30% slower.

vtjnash · 2014-05-08T18:55:50Z

Are you using a 0.3 binary (core2) on a newer processor?

I did not see this in my testing last week

JeffBezanson · 2014-05-08T19:14:24Z

Using a core2 binary would be a good explanation of this. I only see the slowdown on codespeed and not in manual testing anywhere else.

blakejohnson · 2014-05-08T20:11:57Z

I was just referring to codespeed results.

vtjnash · 2014-05-08T21:43:21Z

In that case, I suspect it is using a core2 binary distribution build

mlubin · 2014-05-24T18:53:53Z

The SimplexBenchmarks also show some performance degradation between 0.2 and 0.3, at least on my machine. Both are built from source, not using a binary distribution.

Geometric mean (relative to C++ with bounds checking):
       Julia.2  Julia.3 C++
mtvec:  1.18    1.26    0.76
smtvec: 1.10    1.19    0.89
rto2:   1.34    1.44    0.85
srto2:  1.31    1.37    0.87
updul:  1.20    1.25    0.70
supdul: 1.38    1.47    0.74

mlubin · 2014-05-24T18:55:13Z

Note: to get that output, I modified runBenchmarks.jl to run both 0.2 and 0.3

mauro3 · 2014-06-09T15:58:34Z

In PR #7177 I added performance tests for sparse getindex. I ran them for 0.2.1, for 0.3 in March (when getindex was virtually the same as in 0.2.1), and for a current Julia build:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42

Performance decreased in many tests from 0.2.1 to 0.3, even though the getindex methods did not change. For instance performance of this function, essentially a binary search, decrease by 100% (see sparse_getindex_medium2 test):

function getindex{T}(A::SparseMatrixCSC{T}, i0::Integer, i1::Integer)
    if !(1 <= i0 <= A.m && 1 <= i1 <= A.n); error(BoundsError); end
    first = A.colptr[i1]
    last = A.colptr[i1+1]-1
    while first <= last
        mid = (first + last) >> 1
        t = A.rowval[mid]
        if t == i0
            return A.nzval[mid]
        elseif t > i0
            last = mid - 1
        else
            first = mid + 1
        end
    end
    return zero(T)
end

JeffBezanson · 2014-06-09T17:06:46Z

Looked into it; turns out this particular regression is caused by 14d0a7d, where I changed the branch structure of while loops. That change improved performance in #5469, so we have a bit of a problem. It'd be nice if this can be fixed by some miraculous rearrangement of LLVM optimizer passes.

ViralBShah · 2014-06-09T17:31:52Z

Ouch.

JeffBezanson · 2014-06-09T17:40:19Z

I did some fuzzy binary search of the passes, and was able to get significant improvements in many benchmarks by removing all of the CFGSimplificationPasses. Didn't seem to cause much regression either. Some benchmarks improve a bit more if just the first CFGSimplificationPass is kept.

StefanKarpinski · 2014-06-09T17:43:01Z

I'm tempted to open an issue about using genetic programming to optimized optimization passes. I'm not generally a fan of GA, but this does seem like a particularly well-suited problem.

johnmyleswhite · 2014-06-09T17:43:59Z

What about simulated annealing instead?

StefanKarpinski · 2014-06-09T17:45:48Z

Would also quite possibly would be good for this.

JeffBezanson · 2014-06-09T17:48:10Z

Seems to be a quasi-bug in LLVM, perhaps a quirk of the legacy JIT's native code generator. Doing a CFG simp pass at the end seriously screws up code gen, which doesn't seem like it should happen. @Keno would be interesting to check MCJIT.

Keno · 2014-06-09T18:18:14Z

Not sure what exactly you're looking at, but you can easily try it yourself with a second copy of julia. Just set LLVM_VER to svn.

ref #6112

mauro3 · 2014-06-09T19:32:01Z

I re-ran the performance tests with MCJIT for the 0.3-March version. No improvement, no difference in fact. File v0.3-cc307ea-March-MCJIT added to:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42

JeffBezanson · 2014-06-09T19:32:55Z

Thanks. Could you try with my latest change?

mauro3 · 2014-06-09T20:44:50Z

I added the run with the latest source to:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42#file-v0-3-247e7f844
However, to better compare to v0.2.1, I backported the old getindex functions to the latest and ran perf.jl:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42#file-v0-3-247e7f844-backport

The binary search is now slightly faster than ever before. However, there are still quite a few performance regression in those test of up to ~25%. @JeffBezanson: if you are interested, I can look into it more closely tomorrow to narrow it down to some specific functions.

JeffBezanson · 2014-06-09T20:47:24Z

We might have to live with a few 25% regressions instead of the 100% regression.

ViralBShah · 2014-06-10T03:06:41Z

We should keep track of all regressions once 0.3 is released so that they can be improvement targets in 0.4 where we will probably move to MCJIT.

ViralBShah · 2014-06-10T03:09:20Z

@mauro3 It would be great to narrow down the cause of performance loss and have a short test just for the record here.

mauro3 · 2014-06-10T21:37:14Z

Julia sure is a moving target: I isolated one of the offenders only to find out that those performance regressions have been fixed over the last 24h. (here the test if anyone is interested: https://gist.github.com/mauro3/4274870c64c38aeeb722)

The other one I found (still there) is due to sortperm, which is presumably the quicksort regression mentioned above. Here the testcase: https://gist.github.com/mauro3/8745144b120763fbf225 . I think this is now the only bit causing regressions in test/perf/sparse/perf.jl since 0.2.1.

ViralBShah · 2014-06-11T03:54:51Z

That is a relief. One of the reasons I first wrote the sparse matrix support was to push the compiler, and it continues to be so.

mlubin · 2014-06-12T04:05:58Z

Any comment on the performance degradation on the simplex benchmarks?

ViralBShah · 2014-06-12T04:42:51Z

@JeffBezanson 's comments suggest that some benchmarks have improved. Are the simplex benchmarks still slower?

mlubin · 2014-06-12T16:36:36Z

I see a slight improvement, but definitely not back to 0.2 levels. Probably not worth holding up the release for this, though.

quinnj · 2014-07-03T13:43:45Z

Is there anything specific keeping this open? Maybe we can open a specific simplex regression issue if that's something we want to take another stab at in the future.

JeffBezanson · 2014-07-03T16:05:00Z

There is still a small regression in meteor_contest, but everything else is fixed.

mauro3 · 2014-07-03T19:31:40Z

In the test I posted above (https://gist.github.com/mauro3/8745144b120763fbf225) which uses sortperm I still see a almost 100% regression:

$ julia0.2.1 /tmp/sp.jl
elapsed time: 0.048070613 seconds (41598672 bytes allocated)
$ julia /tmp/sp.jl     
elapsed time: 0.086982457 seconds (21445720 bytes allocated)

JeffBezanson · 2014-07-03T19:41:32Z

Looks like we allocate much less memory though, so ... win? :)

on my system, now even faster than before

JeffBezanson · 2014-07-03T20:16:25Z

fixed!

JeffBezanson added this to the 0.3 milestone Mar 11, 2014

JeffBezanson added performance labels Mar 11, 2014

JeffBezanson added a commit that referenced this issue Mar 13, 2014

use LLVM intrinsic for sqrt. helps #6112

244ec92

JeffBezanson added a commit that referenced this issue Mar 29, 2014

use powi intrinsic for float^int. ref #6112

fa315c9

restores the performance of stockcorr

tkelman pushed a commit to tkelman/julia that referenced this issue Mar 30, 2014

use powi intrinsic for float^int. ref JuliaLang#6112

13d00f4

restores the performance of stockcorr

lindahua mentioned this issue May 24, 2014

Set cutoff threshold for functions that rely on BLAS Level 1 routines (Close #6951) #6959

Merged

ViralBShah mentioned this issue Jun 9, 2014

RFC: added performance tests for sparse getindex + a README file #7177

Merged

JeffBezanson added a commit that referenced this issue Jun 9, 2014

disable some optimization passes that are doing more harm than good

247e7f8

ref #6112

JeffBezanson closed this as completed Jul 3, 2014

JeffBezanson added a commit that referenced this issue Jul 3, 2014

speed up sortperm. ref #6112

e8b6c45

on my system, now even faster than before

jrevels mentioned this issue Nov 6, 2015

CI Performance Tracking for v0.5 #13893

Closed

4 tasks

performance regressions since 0.2 #6112

performance regressions since 0.2 #6112

Comments

JeffBezanson commented Mar 11, 2014

JeffBezanson commented Mar 12, 2014

JeffBezanson commented Mar 13, 2014

vtjnash commented Apr 26, 2014

blakejohnson commented May 8, 2014

vtjnash commented May 8, 2014

JeffBezanson commented May 8, 2014

blakejohnson commented May 8, 2014

vtjnash commented May 8, 2014

mlubin commented May 24, 2014

mlubin commented May 24, 2014

mauro3 commented Jun 9, 2014

JeffBezanson commented Jun 9, 2014

ViralBShah commented Jun 9, 2014

JeffBezanson commented Jun 9, 2014

StefanKarpinski commented Jun 9, 2014

johnmyleswhite commented Jun 9, 2014

StefanKarpinski commented Jun 9, 2014

JeffBezanson commented Jun 9, 2014

Keno commented Jun 9, 2014

mauro3 commented Jun 9, 2014

JeffBezanson commented Jun 9, 2014

mauro3 commented Jun 9, 2014

JeffBezanson commented Jun 9, 2014

ViralBShah commented Jun 10, 2014

ViralBShah commented Jun 10, 2014

mauro3 commented Jun 10, 2014

ViralBShah commented Jun 11, 2014

mlubin commented Jun 12, 2014

ViralBShah commented Jun 12, 2014

mlubin commented Jun 12, 2014

quinnj commented Jul 3, 2014

JeffBezanson commented Jul 3, 2014

mauro3 commented Jul 3, 2014

JeffBezanson commented Jul 3, 2014

JeffBezanson commented Jul 3, 2014