Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regressions since 0.2 #6112

Closed
JeffBezanson opened this issue Mar 11, 2014 · 35 comments
Closed

performance regressions since 0.2 #6112

JeffBezanson opened this issue Mar 11, 2014 · 35 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@JeffBezanson
Copy link
Member

Some of these are quite bad.

0.3    printfd              34.571   
0.2    printfd              25.737   

0.3    sparsemul           118.917  
0.2    sparsemul           105.659  

0.3    stockcorr           653.166 
0.2    stockcorr           447.858  

0.3    bench_eu_vec        122.564  
0.2    bench_eu_vec        111.484  

0.3    actorgraph          836.588  
0.2    actorgraph          647.293  

0.3    laplace_vec        1298.435 
0.2    laplace_vec        1088.022 

0.3    k_nucleotide         96.685  
0.2    k_nucleotide         72.413   

0.3    meteor_contest     3757.757 
0.2    meteor_contest     3494.511 
@JeffBezanson JeffBezanson added this to the 0.3 milestone Mar 11, 2014
@JeffBezanson
Copy link
Member Author

Some findings:

I also see slightly slower performance when starting with sys.so. Could be #5125, or possibly the way ccalls are generated.

@JeffBezanson
Copy link
Member Author

Another finding: in 0.2 llvm was able to optimize out sqrt of a constant, and convert pow(x,2) to x*x. The way ccalls are generated for static compilation blocks these optimizations. Using llvm's intrinsic sqrt and pow might fix this (#5983).

JeffBezanson added a commit that referenced this issue Mar 29, 2014
restores the performance of stockcorr
tkelman pushed a commit to tkelman/julia that referenced this issue Mar 30, 2014
restores the performance of stockcorr
@vtjnash
Copy link
Member

vtjnash commented Apr 26, 2014

most of these look pretty good now, comparing 0.2.1 to master (+ #6599 patch, tested at many inlining thresholds):

these two look much worse (100% slower):
bench_eu_vec
quicksort

these two look slightly worse (10% slower):
bench_eu_devec
stockcorr

the rest seem about the same (within experimental tolerance) or faster

quicksort seems to have taken a 50% hit to speed recently. it looks like the only difference in code_typed is the order of lowering for the while loop conditions, which appears to be confusing llvm into emitting unnecessary copies of the conditional. @JeffBezanson?

also, strangely, the performance of small and large on the following tests swapped:
hvcat_small
hvcat_large
hvcat_setind_small
hvcat_setind_large
hcat_small
hcat_large
hcat_setind_small
hcat_setind_large
vcat_small
vcat_large
vcat_setind_small
vcat_setind_large
I'm not sure if this is real, or a change in the tests.

@blakejohnson
Copy link
Contributor

It still looks like we have significant performance regressions almost across the board of the test suite. For example, add1 is about 30% slower.

@vtjnash
Copy link
Member

vtjnash commented May 8, 2014

Are you using a 0.3 binary (core2) on a newer processor?

I did not see this in my testing last week

@JeffBezanson
Copy link
Member Author

Using a core2 binary would be a good explanation of this. I only see the slowdown on codespeed and not in manual testing anywhere else.

@blakejohnson
Copy link
Contributor

I was just referring to codespeed results.

@vtjnash
Copy link
Member

vtjnash commented May 8, 2014

In that case, I suspect it is using a core2 binary distribution build

@mlubin
Copy link
Member

mlubin commented May 24, 2014

The SimplexBenchmarks also show some performance degradation between 0.2 and 0.3, at least on my machine. Both are built from source, not using a binary distribution.

Geometric mean (relative to C++ with bounds checking):
       Julia.2  Julia.3 C++
mtvec:  1.18    1.26    0.76
smtvec: 1.10    1.19    0.89
rto2:   1.34    1.44    0.85
srto2:  1.31    1.37    0.87
updul:  1.20    1.25    0.70
supdul: 1.38    1.47    0.74

@mlubin
Copy link
Member

mlubin commented May 24, 2014

Note: to get that output, I modified runBenchmarks.jl to run both 0.2 and 0.3

@mauro3
Copy link
Contributor

mauro3 commented Jun 9, 2014

In PR #7177 I added performance tests for sparse getindex. I ran them for 0.2.1, for 0.3 in March (when getindex was virtually the same as in 0.2.1), and for a current Julia build:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42

Performance decreased in many tests from 0.2.1 to 0.3, even though the getindex methods did not change. For instance performance of this function, essentially a binary search, decrease by 100% (see sparse_getindex_medium2 test):

function getindex{T}(A::SparseMatrixCSC{T}, i0::Integer, i1::Integer)
    if !(1 <= i0 <= A.m && 1 <= i1 <= A.n); error(BoundsError); end
    first = A.colptr[i1]
    last = A.colptr[i1+1]-1
    while first <= last
        mid = (first + last) >> 1
        t = A.rowval[mid]
        if t == i0
            return A.nzval[mid]
        elseif t > i0
            last = mid - 1
        else
            first = mid + 1
        end
    end
    return zero(T)
end

@JeffBezanson
Copy link
Member Author

Looked into it; turns out this particular regression is caused by 14d0a7d, where I changed the branch structure of while loops. That change improved performance in #5469, so we have a bit of a problem. It'd be nice if this can be fixed by some miraculous rearrangement of LLVM optimizer passes.

@ViralBShah
Copy link
Member

Ouch.

@JeffBezanson
Copy link
Member Author

I did some fuzzy binary search of the passes, and was able to get significant improvements in many benchmarks by removing all of the CFGSimplificationPasses. Didn't seem to cause much regression either. Some benchmarks improve a bit more if just the first CFGSimplificationPass is kept.

@StefanKarpinski
Copy link
Member

I'm tempted to open an issue about using genetic programming to optimized optimization passes. I'm not generally a fan of GA, but this does seem like a particularly well-suited problem.

@johnmyleswhite
Copy link
Member

What about simulated annealing instead?

@StefanKarpinski
Copy link
Member

Would also quite possibly would be good for this.

@JeffBezanson
Copy link
Member Author

Seems to be a quasi-bug in LLVM, perhaps a quirk of the legacy JIT's native code generator. Doing a CFG simp pass at the end seriously screws up code gen, which doesn't seem like it should happen. @Keno would be interesting to check MCJIT.

@Keno
Copy link
Member

Keno commented Jun 9, 2014

Not sure what exactly you're looking at, but you can easily try it yourself with a second copy of julia. Just set LLVM_VER to svn.

@mauro3
Copy link
Contributor

mauro3 commented Jun 9, 2014

I re-ran the performance tests with MCJIT for the 0.3-March version. No improvement, no difference in fact. File v0.3-cc307ea-March-MCJIT added to:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42

@JeffBezanson
Copy link
Member Author

Thanks. Could you try with my latest change?

@mauro3
Copy link
Contributor

mauro3 commented Jun 9, 2014

I added the run with the latest source to:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42#file-v0-3-247e7f844
However, to better compare to v0.2.1, I backported the old getindex functions to the latest and ran perf.jl:
https://gist.github.com/mauro3/20e0d7136f6cc2147e42#file-v0-3-247e7f844-backport

The binary search is now slightly faster than ever before. However, there are still quite a few performance regression in those test of up to ~25%. @JeffBezanson: if you are interested, I can look into it more closely tomorrow to narrow it down to some specific functions.

@JeffBezanson
Copy link
Member Author

We might have to live with a few 25% regressions instead of the 100% regression.

@ViralBShah
Copy link
Member

We should keep track of all regressions once 0.3 is released so that they can be improvement targets in 0.4 where we will probably move to MCJIT.

@ViralBShah
Copy link
Member

@mauro3 It would be great to narrow down the cause of performance loss and have a short test just for the record here.

@mauro3
Copy link
Contributor

mauro3 commented Jun 10, 2014

Julia sure is a moving target: I isolated one of the offenders only to find out that those performance regressions have been fixed over the last 24h. (here the test if anyone is interested: https://gist.github.com/mauro3/4274870c64c38aeeb722)

The other one I found (still there) is due to sortperm, which is presumably the quicksort regression mentioned above. Here the testcase: https://gist.github.com/mauro3/8745144b120763fbf225 . I think this is now the only bit causing regressions in test/perf/sparse/perf.jl since 0.2.1.

@ViralBShah
Copy link
Member

That is a relief. One of the reasons I first wrote the sparse matrix support was to push the compiler, and it continues to be so.

@mlubin
Copy link
Member

mlubin commented Jun 12, 2014

Any comment on the performance degradation on the simplex benchmarks?

@ViralBShah
Copy link
Member

@JeffBezanson 's comments suggest that some benchmarks have improved. Are the simplex benchmarks still slower?

@mlubin
Copy link
Member

mlubin commented Jun 12, 2014

I see a slight improvement, but definitely not back to 0.2 levels. Probably not worth holding up the release for this, though.

@quinnj
Copy link
Member

quinnj commented Jul 3, 2014

Is there anything specific keeping this open? Maybe we can open a specific simplex regression issue if that's something we want to take another stab at in the future.

@JeffBezanson
Copy link
Member Author

There is still a small regression in meteor_contest, but everything else is fixed.

@mauro3
Copy link
Contributor

mauro3 commented Jul 3, 2014

In the test I posted above (https://gist.github.com/mauro3/8745144b120763fbf225) which uses sortperm I still see a almost 100% regression:

$ julia0.2.1 /tmp/sp.jl
elapsed time: 0.048070613 seconds (41598672 bytes allocated)
$ julia /tmp/sp.jl     
elapsed time: 0.086982457 seconds (21445720 bytes allocated)

@JeffBezanson
Copy link
Member Author

Looks like we allocate much less memory though, so ... win? :)

JeffBezanson added a commit that referenced this issue Jul 3, 2014
on my system, now even faster than before
@JeffBezanson
Copy link
Member Author

fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

10 participants