Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum(f, itr; init) slower than sum(f, itr) #47216

Open
giordano opened this issue Oct 18, 2022 · 3 comments
Open

sum(f, itr; init) slower than sum(f, itr) #47216

giordano opened this issue Oct 18, 2022 · 3 comments
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster

Comments

@giordano
Copy link
Contributor

In https://github.com/niklas-heer/speed-comparison we noticed that sum(f, itr; init) is slower than sum(f, itr)

julia> f(n) = sum(i -> 4/(4i-2n-3), 1:n)
f (generic function with 1 method)

julia> g(n) = sum(i -> 4/(4i-2n-3), 1:n; init=0.0)
g (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime f(100_000_000)
  217.560 ms (0 allocations: 0 bytes)
3.1415926435897923

julia> @btime g(100_000_000)
  398.015 ms (0 allocations: 0 bytes)
3.1415926435899633

julia> versioninfo()
Julia Version 1.9.0-DEV.1600
Commit 392bc97a3a (2022-10-16 17:39 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, haswell)
  Threads: 1 on 8 virtual cores

which is a pity because sum(f, itr; init) can be statically compiled, while sum(f, itr) cannot (because of some internal function throwing).

@giordano giordano added performance Must go faster fold sum, maximum, reduce, foldl, etc. labels Oct 18, 2022
@Moelf
Copy link
Contributor

Moelf commented Oct 19, 2022

the fast path

f(n) = Base._mapreduce_dim(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n, :)

## so totally different path
_mapreduce_dim(f, op, ::_InitialValue, A::AbstractArrayOrBroadcasted, ::Colon) =
    _mapreduce(f, op, IndexStyle(A), A)

for array longer than 16, eventually calls:

mapreduce_impl(f, op, A, first(inds), last(finds))

mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, ifirst::Integer, ilast::Integer, blksize=1024)

the slow path:

g(n) = Base.mapfoldl_impl(i->4/(4i-2n-3), Base.add_sum, Base._InitialValue(), 1:n)

@N5N3
Copy link
Member

N5N3 commented Oct 19, 2022

Our fold routine has no @simd thus we need @fastmath here to make LLVM vectorlize the sum reduction.

g_fast(n) = @fastmath mapfoldl(i -> 4/(4i-2n-3), +, 1:n)

In fact g_fast should be faster than f as it doesn't use pairwise reduction.

@matthias314
Copy link
Contributor

Let me repeat my suggestion from #49763: sum and prod use add_sum and mul_prod. What about making these functions use the fast versions of + and *? I think there is no fixed order of the terms required for sum and prod. Moreover, the fast versions seem to respect NaN (at least on my machine). Then we would get:

julia> Base.add_sum(x::Real, y::Real) = @fastmath x+y
julia> v = rand(1000);
julia> @btime sum($v);
  72.881 ns (0 allocations: 0 bytes)   # before: 73.487 ns
julia> @btime sum($v; init = 0.0);
  71.168 ns (0 allocations: 0 bytes)   # before: 1.020 μs

In order to extend this to maximum and minimum, there are two issues:

  • The current versions of max_fast and min_fast do not vectorize.
  • It seems tricky to get versions that vectorize and treat NaN as it's done now.

For example, the function

@generated function fastmax(x::Float64, y::Float64)
    quote
        $(Expr(:meta, :inline));
        Base.llvmcall("""
    %f = fcmp fast ugt double %0, %1
    %r = select i1 %f, double %0, double %1
    ret double %r
            """, Float64, Tuple{Float64,Float64}, x, y)
    end
end

vectorizes, but it doesn't always return NaN if one of the arguments is NaN. Maybe one could add a flag to maximum and minimum to indicate that such a function can be used. It would be much faster than the current implementation:

julia> v = rand(1000);
# currently:
julia> @btime maximum($v);
  1.019 μs (0 allocations: 0 bytes)
julia> @btime maximum($v; init = 0.0);
  1.988 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v);
  1.024 μs (0 allocations: 0 bytes)
julia> @btime @fastmath reduce(max, $v; init = 0.0);
  1.020 μs (0 allocations: 0 bytes)
# new
julia> @btime reduce(fastmax, $v);
  74.428 ns (0 allocations: 0 bytes)
julia> @btime reduce(fastmax, $v; init = 0.0);
  71.370 ns (0 allocations: 0 bytes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster
Projects
None yet
Development

No branches or pull requests

4 participants