Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

laplace equation benchmark performance #1168

Closed
ViralBShah opened this issue Aug 17, 2012 · 45 comments
Closed

laplace equation benchmark performance #1168

ViralBShah opened this issue Aug 17, 2012 · 45 comments
Labels
performance Must go faster

Comments

@ViralBShah
Copy link
Member

Track performance of the laplace equation benchmark. Discussion here:

https://groups.google.com/d/topic/julia-dev/VDlr3_6Szos/discussion

@ViralBShah
Copy link
Member Author

Benchmark has been added to the perf2 suite.

https://github.com/JuliaLang/julia/tree/master/test/perf2

@JeffBezanson
Copy link
Member

I don't think the vectorized version will ever be as fast as the other one.
These performance issues are getting out of hand. They need to be analyzed and narrowed down.

@ViralBShah
Copy link
Member Author

Yes, I was just thinking the same. The vectorized version does far too much work, but there is certainly room for improvement.

-viral

On 17-Aug-2012, at 3:57 PM, Jeff Bezanson wrote:

I don't think the vectorized version will ever be as fast as the other one.
These performance issues are getting out of hand. They need to be analyzed and narrowed down.


Reply to this email directly or view it on GitHub.

ViralBShah added a commit that referenced this issue Aug 20, 2012
@ViralBShah
Copy link
Member Author

Original multi-language performance comparison:

    Octave (0):  14.9932 seconds
     GNU R (0):  56.4161 seconds
     Julia (0):  39.1679 seconds
              gcc     intel
  1      :  13.4774     *     NumPy 
  2      :      *     1.6633  Cilk Plus 
  3,   4 :   2.4963   1.9179  Vectorized Fortran (pure) 

@ViralBShah
Copy link
Member Author

Devectorized julia for the same number of iterations (2^15) now runs in 3.8 seconds for me.

Does numpy implement views and ends up evaluating lazily in this case, and hence end up being much faster? Since octave ends up taking the same amount of time, I presume that we just have a lot of room for improvement in julia.

@toivoh
Copy link
Contributor

toivoh commented Aug 20, 2012

Yes, numpy indexing with scalars and slices (ranges) always produces views.

@ViralBShah
Copy link
Member Author

Does octave do the same? I didn't think so - since it is matching numpy's performance. I'll write the benchmark in a few other languages for comparison.

@toivoh
Copy link
Contributor

toivoh commented Aug 20, 2012

I would imagine that it does copy on write, since it strives to emulate matlab. But since I'm not an octave user, I really don't know.

@StefanKarpinski
Copy link
Member

I really think it's arguable that our slices should be views.

@toivoh
Copy link
Contributor

toivoh commented Aug 20, 2012

+1
Except for copy on write (which I'm definitely not suggesting),
I'm not aware of any other sane way to get performance with slices.

@JeffBezanson
Copy link
Member

I could get on board with that. We either have to make all ops on SubArray fast, or build views in to the Array type. This is actually a tough decision. Of course we want it to be possible to write something like SubArray and have it be fast.
An issue that immediately comes up is that linear indexing is no longer fast with views, though it is insanely fast for plain dense arrays. This makes me think we need a smarter index abstraction so there is a fast and generic way to iterate over more kinds of arrays. I'm willing to go to extreme lengths to get this. Syntax, changes to the compiler IR, everything is on the table. Rewriting array.jl is no big deal. We can do that in maybe a day.

@timholy
Copy link
Member

timholy commented Aug 21, 2012

An issue that immediately comes up is that linear indexing is no longer fast with views

In my currently-stalled rewrite of the Image library, I have a substitute brewing:

type Counter{N}
    max::NTuple{N, Int}
end
Counter(sz) = Counter{length(sz)}(to_tuple(sz))

function start{N}(c::Counter{N})
    state = ones(Int,N)
    state[1] = 0 # because of start/done/next sequence, start out one behind
    return state
end
function done(c::Counter, state)
    # we do the increment as part of "done" to make exit-testing more efficient
    state[1] += 1
    i = 1
    max = c.max
    while state[i] > max[i] && i < length(state)
        state[i] = 1
        i += 1
        state[i] += 1
    end
    state[end] > c.max[end]
end
function next(c::Counter, state)
    return state, state
end

This can be pretty fast, particularly if you handle the inner loop like this:

szv = [sz...]  # vector from tuple
szv[1] = 1
for c = Counter(szv)
    cprod = prod(c)
    for i = 1:sz[1]
        p[index] = i*cprod
        index += 1
    end
end

Then for most applications the performance is essentially indistinguishable from linear indexing. (However, there are traps: this is the usage that prompted me to report performance issue 73d8ca4). Assuming it could be generalized to work on general ranges, and to return a linear index, would that help?

I think in my initial version of this I tried returning both the linear index and the coordinate representation, and also ran into performance problems due to tuple creation, but memory of exactly what was going on is now hazy.

@timholy
Copy link
Member

timholy commented Aug 21, 2012

Should have specified: on the inner loop I was demoing how you'd set each element to the product of coordinates, i.e., p[i,j,k] = i*j*k. And you'd need to set index = 1 at the beginning.

@JeffBezanson
Copy link
Member

This seems to rely on making the first dimension a special case for performance, or at least assuming it is large enough for the inner loop to dominate. I think we need something more drastic. We may also need to depend on better compiler optimizations like allocating tuples on the stack or in registers.

@timholy
Copy link
Member

timholy commented Aug 21, 2012

Yes, that's basically true.

OK, might one be able to generate a user-friendly version of gen_array_index_map that would make it trivial to write linear-indexing code wherever it might appear? Or is that something that would have to go into the compiler?

@toivoh
Copy link
Contributor

toivoh commented Aug 21, 2012

This is turning into a full-blown discussion. Perhaps best to move it over to julia-dev to invite some more ideas?

@ViralBShah
Copy link
Member Author

I would rather invite anyone to have a discussion here, so that all the perf issues can be tracked here and are addressed.

@ViralBShah
Copy link
Member Author

On the pure scalar indexing bit, we are still a factor of 1.5x-2x slower than C. So long as we do not close that gap either through better code generation and other optimizations, vectorized julia codes will continue to remain that much slower compared to other languages where the vectorized operations are implemented in C.

@timholy
Copy link
Member

timholy commented Aug 22, 2012

There was a mailing list discussion a while back that found that bounds-checking accounts for some of this. Code with bounds-checks disabled (by commenting out two lines in codegen.cpp) ran in about 75% of the time of the code with bounds-checks, pretty constantly on three different machines (mine, Krys', and Bill's, if memory serves). Not a huge performance difference, but if you're thinking about factors of 2 that will be some chunk of it.

@kk49
Copy link
Contributor

kk49 commented Aug 22, 2012

I think with bounds-checking removed the code generated in loops was about the same as what clang generates for C/C++. You just have to feed the JIT engine enough code to be able to optimize it.

@StefanKarpinski
Copy link
Member

I really think it's arguable that our slices should be views.

I could get on board with that. We either have to make all ops on SubArray fast, or build views in to the Array type. This is actually a tough decision. Of course we want it to be possible to write something like SubArray and have it be fast.

Using views as the default behavior seems appealing for a few reasons:

  1. It's a very fast default behavior that's extremely powerful if you know how to use it.
  2. It doesn't in practice feel very different from Matlab's behavior; sure there are cases where you assign into a view and that would have an effect, but they're relatively rare and can be fixed with an explicit copy.
  3. Instead of two ways of slicing things that we currently have now, we would have only one way of slicing things; the other way would become a slice + copy, which is exactly what it is; essentially, we'd be exposing the more fundamental behavior as core, rather than the opposite.

An issue that immediately comes up is that linear indexing is no longer fast with views, though it is insanely fast for plain dense arrays. This makes me think we need a smarter index abstraction so there is a fast and generic way to iterate over more kinds of arrays. I'm willing to go to extreme lengths to get this. Syntax, changes to the compiler IR, everything is on the table. Rewriting array.jl is no big deal. We can do that in maybe a day.

Now that we have many more examples of types of arrays (sparse, distributed, etc.), we can maybe come up with something that unifies them better and has good performance.

@ViralBShah
Copy link
Member Author

I ran the Laplace benchmark on master:

Julia (vec) - 1874 ms
Matlab (vec) - 220 ms

Julia (devec) - 135 ms
C - 162 ms

It would be nice to figure out if this is a problem of too many temporaries, and to what extent we need to rethink our indexing by using views and such.

@ViralBShah
Copy link
Member Author

I am marking this as 0.2 just to figure out if there is something we can do to improve our performance for vectorized codes by the next release.

@diegozea
Copy link
Contributor

Some other issues related to this:
#1392
#2360
#1802
#1802

@timholy
Copy link
Member

timholy commented Mar 18, 2013

It's definitely a problem of too many temporaries, as a simple sprofile run will show: out of 1104 captures on perf2/laplace.jl; laplace_iter_vec; line: 25, there were 970 at array.jl; +; line: 875. That line allocates the output array:

  F = Array(promote_type(S,T), promote_shape(size(A),size(B)))

(Out of those 970, 963 were in jl_gc_collect, so the other function calls on that line have negligible impact.)

However, I'd guess that this is not the kind of thing we're likely to solve in 2-3 weeks. I'd also tentatively suggest that Julia's inlining behavior may be a higher priority. The reason is simple: often a careful programmer can "write around" the library's current tendency to allocate temporaries. It can be inconvenient, but it's usually possible. Case in point: laplace_iter_devec is probably more readable than laplace_iter_vec. In contrast, the mere fact that map(abs, A) is much slower than abs(A)---and there's nothing that one not steeped in LLVM lore can do to change that---influences language design and usage patterns in deep ways.

But coming from me this is all hot air, because ATM I don't feel up to the challenge of tackling this myself.

@StefanKarpinski
Copy link
Member

I think that's a good point, @timholy. cc: @JeffBezanson, who is, of course, already following this thread.

@JeffBezanson
Copy link
Member

I concur that it will not be solved in 2 weeks just by setting a milestone. But I have plans. I have been ruminating a lot about julia's theory of "what is a function", and I may be starting to get somewhere.

@timholy
Copy link
Member

timholy commented Mar 18, 2013

Sounds exciting! Here, have a nice cup of tea to help your ruminations.

@ViralBShah
Copy link
Member Author

Right, I didn't expect anything to be solved in 2 weeks, but even being able to give a rough plan on tackling this would be an important signal for everyone.

Also, as @JeffBezanson has often said, the best way to avoid GC is to not allocate memory. Perhaps we can revisit our indexing along these lines, as is discussed in this thread.

@JeffBezanson
Copy link
Member

57e010e helps both laplace_vec and laplace_devec significantly (but does not fix the gap between them).

@ViralBShah
Copy link
Member Author

I just reran this benchmark on julia.mit.edu, and find that octave takes 0.8 seconds, whereas julia takes 2.2 seconds.

@ViralBShah
Copy link
Member Author

I added a subarray version here, and it is significantly slower. I guess we have to wait for tuple overhaul. In any case, I was wondering if we can characterize why the vectorized version is so much slower (subarrays or otherwise), compared to the devectorized version. I was kind of hoping that @carnaval 's GC work would help here, and I wonder if a better GC will help (perhaps tune some knobs in the new GC), or if the gains need to come from elsewhere.

julia> @time laplace_vec_sub();
elapsed time: 4.229344435 seconds (1028 MB allocated, 3.58% gc time in 47 pauses with 0 full sweep)

julia> @time laplace_vec();
elapsed time: 1.707854514 seconds (1715 MB allocated, 13.36% gc time in 78 pauses with 0 full sweep)

julia> @time laplace_devec();
elapsed time: 0.162232645 seconds (1 MB allocated)

@JeffBezanson
Copy link
Member

I thought the new gc already helped here significantly. Though of course it does not close the whole gap. We should work on subarrays.

@ViralBShah
Copy link
Member Author

If you see my older comment and the new results, the difference went from 13x to 10x. The new GC helped significantly on bench_eu, but not much on laplace.

@timholy
Copy link
Member

timholy commented Apr 16, 2015

I tried a subarray version a while ago, and it was a little faster. Can you link to your code?

@ViralBShah
Copy link
Member Author

I have checked in the subarray version into laplace.jl. With tuple overhaul now, there are significant gains, but still too much memory allocation.

julia> @time laplace_vec_sub();
elapsed time: 2.958093484 seconds (1027 MB allocated, 3.30% gc time in 47 pauses with 0 full sweep)

@timholy
Copy link
Member

timholy commented Apr 17, 2015

It was an indexing problem, not a subarray problem. Fixed in #10858. Will be improved further if #9080 can be fixed.

Almost all of the remaining memory allocation comes from A+B, c*A, etc.

@timholy
Copy link
Member

timholy commented Apr 17, 2015

I should add: technically, with the new GC the performance problems are presumably more an issue of cache efficiency than memory allocation per se. When you allocate a temporary C = A+B, you traverse A, B, and C; then in D = 2*C, you traverse C again. In contrast, the devectorized version traverses everything once.

@carnaval
Copy link
Contributor

Cache behavior is especially bad if you are allocating temps in a loop because you will get fresh (read: not cached) memory at each iteration up until collection, and start all over again.
With preallocation you get very good temporal coherency of course.
We may have a good solution for this kind of things at the end of the semester if all goes well :-)

@timholy
Copy link
Member

timholy commented Apr 17, 2015

We may have a good solution for this kind of things at the end of the semester if all goes well :-)

Sounds like something to look forward to!

@ViralBShah
Copy link
Member Author

I am really looking forward to it too!

@stevengj
Copy link
Member

stevengj commented Oct 6, 2016

Hopefully the performance of the vectorized version will be fixed by #17623, assuming we change the code to use .+ etcetera.

@StefanKarpinski
Copy link
Member

Syntactic fusion seems like a completely fair solution to this performance issue.

@stevengj
Copy link
Member

stevengj commented Dec 20, 2016

Even with loop fusion, this is still taking a lot more allocations than I would have expected: Nevermind, this was a typo in my code (I accidentally used a global).

julia> function laplace!(u, Niter, dx2, dy2)
           Ny, Nx = size(u)
           c = 1 / (2*(dx2+dy2))
           u0 = @view u[2:Ny-1, 2:Nx-1]
           u1 = @view u[1:Ny-2, 2:Nx-1]
           u2 = @view u[3:Ny, 2:Nx-1]
           u3 = @view u[2:Ny-1,1:Nx-2]
           u4 = @view u[2:Ny-1, 3:Nx]
           for i = 1:Niter
             u0 .= ((u1 .+ u2) .* dy2 .+ (u3 .+ u4) .* dx2) .* c
           end
           return u0
       end
laplace! (generic function with 1 method)

julia> u = zeros(150,150); u[1,:] = 1
1

julia> @time laplace!(u, 2^15, 0.01, 0.01);
  7.999779 seconds (301.50 k allocations: 10.248 MB, 0.58% gc time)

julia> @time laplace!(u, 2^15, 0.01, 0.01);
  7.747673 seconds (10 allocations: 496 bytes)

@yuyichao, @pabloferz, any ideas? Is this some problem with views?

@stevengj stevengj added the potential benchmark Could make a good benchmark in BaseBenchmarks label Dec 20, 2016
@stevengj
Copy link
Member

In comparison, an unrolled version does slightly better, which is not too unexpected — there is some overhead (less than a factor of 2) to using all of those views and generic broadcasting iteration:

julia> function laplace_iter!(u, Niter, dx2, dy2)
                  Ny, Nx = size(u)
                  c = 1 / (2*(dx2+dy2))
                  for iter = 1:Niter
                    for i = 2:size(u,1)-1, j = 2:size(u,2)-1 
                      u[i,j] = ((u[i-1,j]+u[i+1,j])*dx2 + (u[i,j-1]+u[i,j+1])*dy2)*c 
                    end 
                  end
                  return u
              end
laplace_iter! (generic function with 1 method)

julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
  4.281638 seconds (7.69 k allocations: 327.032 KB)

julia> @time laplace_iter!(u, 2^15, 0.01, 0.01);
  4.221090 seconds (5 allocations: 176 bytes)

But I think the vectorized version is fast enough that we can now close this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

9 participants