Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reducedim much slower for user-defined functions #3426

Closed
simonster opened this issue Jun 17, 2013 · 5 comments
Closed

reducedim much slower for user-defined functions #3426

simonster opened this issue Jun 17, 2013 · 5 comments
Milestone

Comments

@simonster
Copy link
Member

using Benchmark
const A = rand(1000, 1000);
f1() = reducedim(+, A, 2, 0.0)
plus(x, y) = x + y
f2() = reducedim(plus, A, 2, 0.0)
compare([f1, f2], 100)

gives:

        Function  Elapsed Relative Replications
[1,]        "f1" 0.593566      1.0          100
[2,]        "f2"  3.70654  6.24453          100

This seems to be due to lack of inlining. I don't see an easy solution that can be implemented in reducedim() itself, although maybe others do. Alternatively, it would be nice to move gen_reducedim_func outside of the let block in array.jl so that other code can use it.

See also my comment at https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/ipbrvflVY1k

@lindahua
Copy link
Contributor

Failure to inline user defined functions is one issue that leads to suboptimal performance of reducedim. However, this is not the only problem.

Issue #2325 shows that sum(x, 2) is about 5x slower than a carefully crafted loop because the order that it scans the array is not cache friendly. I am considering to write a generic yet efficient reduction function that addresses both problems.

@lindahua
Copy link
Contributor

Here is what I am considering to address the inlining problem -- use typed functor instead Julia functions.

abstract BinaryOp

type Add <: BinaryOp end

evaluate(op::Add, x, y) = x + y

function fast_reduce{R<:Number}(op::BinaryOp, initvalue::R, a::AbstractArray)
    v::R = initvalue
    for i in 1 : length(a)
        v = evaluate(op, v, a[i])
    end
    v
end

plus(x, y) = x + y

a = rand(10^6)

# warming
sum(a)
fast_reduce(Add(), 0., a)
reduce(plus, 0., a)

# benchmark
@time for i in 1 : 500 sum(a) end                      # 0.3955 sec
@time for i in 1 : 500 fast_reduce(Add(), 0., a) end   # 0.3938 sec
@time for i in 1 : 500 reduce(plus, 0., a) end         # 21.62 sec

Note here that fast_reduce is comparable to sum which is specialized, while reduce with a user function is 55x slower!

I believe this addresses the performance issue without losing generality.

@simonster
Copy link
Member Author

Yes, both inlining and cache-unfriendliness are big issues here. Even along the first dimension, where cache isn't a problem, things are 1 OOM slower than they could be. Along the second dimension it's more like 2 OOM.

I think the best solution to the inlining problem is to specialize code generation on function arguments, which I think has been mentioned in the past, although I can't find the issue for it. As a stopgap, I'd be happy with a type-based approach like yours above. If the order of iteration were fixed in gen_reducedim_func and if it were moved out of the let block, it should also be able to achieve the same performance.

@lindahua
Copy link
Contributor

Yes, specialized code generation depending on function arguments is the ideal way, which, however, probably takes much longer as it may need some changes in the core.

At this moment, I am going to work on an improved version of reduction based on typed functors.

@JeffBezanson
Copy link
Member

Special case of #210.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants