Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with non-integer powers (v1.6 and newer) #39976

Closed
jkrimmer opened this issue Mar 10, 2021 · 42 comments
Closed

Performance regression with non-integer powers (v1.6 and newer) #39976

jkrimmer opened this issue Mar 10, 2021 · 42 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@jkrimmer
Copy link

With the 1.6.0-rc1 release candidate and the most recent build from master, I have observed a performance regression when taking non-integer powers of floats on Linux (Fedora 33 and Ubuntu 20.04 behind WSL2). Although the MWE is limited to Float64, similar behavior can be observed with Float32.

MWE on 1.6.0-rc1:

julia> using BenchmarkTools

julia> x = 2.0;

julia> @benchmark ($x)^(7/6)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     63.099 ns (0.00% GC)
  median time:      63.710 ns (0.00% GC)
  mean time:        65.135 ns (0.00% GC)
  maximum time:     718.552 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     981

MWE on 1.5.3:

julia> using BenchmarkTools

julia> x = 2.0;

julia> @benchmark ($x)^(7/6)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     13.614 ns (0.00% GC)
  median time:      13.714 ns (0.00% GC)
  mean time:        13.885 ns (0.00% GC)
  maximum time:     88.889 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

I wonder whether this issue is related to LLVM since the output of @code_llvm is pretty much identical on both 1.6.0-rc1 and 1.5.3.

1.6.0-rc1:

julia> @code_llvm (x)^(7/6)
;  @ math.jl:898 within `^'
define double @"julia_^_151"(double %0, double %1) {
top:
;  @ math.jl:899 within `^'
  %2 = call double @llvm.pow.f64(double %0, double %1)
;  @ math.jl:900 within `^'
; ┌ @ float.jl:449 within `isnan'
; │┌ @ float.jl:368 within `!='
    %3 = fcmp ord double %2, 0.000000e+00
; └└
; ┌ @ float.jl:326 within `+'
   %4 = fadd double %0, %1
; └
; ┌ @ float.jl:449 within `isnan'
; │┌ @ float.jl:368 within `!='
    %5 = fcmp uno double %4, 0.000000e+00
; └└
  %6 = or i1 %3, %5
  br i1 %6, label %L10, label %L8

L8:                                               ; preds = %top
;  @ math.jl:901 within `^'
  %7 = call nonnull {}* @j_throw_exp_domainerror_153(double %0)
  call void @llvm.trap()
  unreachable

L10:                                              ; preds = %top
;  @ math.jl:903 within `^'
  ret double %2
}

1.5.3:

julia> @code_llvm (x)^(7/6)

;  @ math.jl:885 within `^'
define double @"julia_^_100"(double, double) {
top:
;  @ math.jl:886 within `^'
  %2 = call double @llvm.pow.f64(double %0, double %1)
;  @ math.jl:887 within `^'
; ┌ @ float.jl:536 within `isnan'
; │┌ @ float.jl:456 within `!='
    %3 = fcmp ord double %2, 0.000000e+00
; └└
; ┌ @ float.jl:401 within `+'
   %4 = fadd double %0, %1
; └
; ┌ @ float.jl:536 within `isnan'
; │┌ @ float.jl:456 within `!='
    %5 = fcmp uno double %4, 0.000000e+00
; └└
  %6 = or i1 %5, %3
  br i1 %6, label %L10, label %L8

L8:                                               ; preds = %top
;  @ math.jl:888 within `^'
  %7 = call nonnull %jl_value_t* @j_throw_exp_domainerror_101(double %0)
  call void @llvm.trap()
  unreachable

L10:                                              ; preds = %top
;  @ math.jl:890 within `^'
  ret double %2
}
@KristofferC
Copy link
Member

using BenchmarkTools
x = 2.0;
powf64(x, y) = ccall("llvm.pow.f64", llvmcall, Float64, (Float64, Float64), x, y)
@btime powf64($x, 7/6)

is a bit smaller repro, isolating it to the llvmcall

@KristofferC KristofferC added performance Must go faster regression Regression in behavior compared to a previous version labels Mar 10, 2021
@anaveragehuman
Copy link
Contributor

Here's my attempt to bisect this with JULIA_PRECOMPILE=0 make -j4 && ./julia followed by running the four lines above:

git bisect start
# bad: [a58bdd90101796eb0ec761a7a8e5103bd96c2d13] release-1.6: set VERSION to 1.6-rc1 (#39487)
git bisect bad a58bdd90101796eb0ec761a7a8e5103bd96c2d13
# good: [788b2c77c10c2160f4794a4d4b6b81a95a90940c]  Set VERSION to 1.5.3 (#38345)
git bisect good 788b2c77c10c2160f4794a4d4b6b81a95a90940c
# good: [0c388fc340cc6db7f531ec3ff5b172e55ff8ceb4] RFC: treat small negative λ as 0 for sqrt(::Hermitian) (#35057)
git bisect good 0c388fc340cc6db7f531ec3ff5b172e55ff8ceb4
# good: [b884dcad9598f862ab6c08c01e2800942769470c] pass correct typeinfo IOContext when showing keys and values in dict (#37567)
git bisect good b884dcad9598f862ab6c08c01e2800942769470c
# bad: [d562a97f2bdb8acfb679c395b7283d6e3b74ea66] Merge pull request #38329 from JuliaLang/ksh/fldmethoderr
git bisect bad d562a97f2bdb8acfb679c395b7283d6e3b74ea66
# bad: [c91a75062866312989ce2f91af4c88f0f3f5896f] bump Statistics stdlib version (#37950)
git bisect bad c91a75062866312989ce2f91af4c88f0f3f5896f
# bad: [8ccc4ccef799143f87a0f9d8a1223d932ac9dc64] Show any compile time spent in time and timev macros (#37678)
git bisect bad 8ccc4ccef799143f87a0f9d8a1223d932ac9dc64
# good: [75ac63245c02ca06bb0eb81ca443e543378afdf4] [LLVM] stage git checkout through a bare repository
git bisect good 75ac63245c02ca06bb0eb81ca443e543378afdf4
# bad: [d7b391dc60f9d30a1ca57668895b5900665dae1c] doc: extend mkpath docstring (#37745)
git bisect bad d7b391dc60f9d30a1ca57668895b5900665dae1c
# bad: [47d1f62035ffb844e96809f7fb28a6c4df1c7adc] inference: also handle typesubtract for tuples where only one parameter remains
git bisect bad 47d1f62035ffb844e96809f7fb28a6c4df1c7adc
# bad: [d148eb249119d53500ef08ad09e57de8eddfaab6] improve precompile generation for vararg signatures (#37715)
git bisect bad d148eb249119d53500ef08ad09e57de8eddfaab6
# good: [f26a8c352fee56b845396f79152ccd01505d8973] Merge pull request #37719 from vchuravy/vc/fixup_rv
git bisect good f26a8c352fee56b845396f79152ccd01505d8973
# bad: [197c8c5d5de0e6e4b7243845127704581b0a97b2] Revert "Update PCRE2 version to 10.34, fixes #35322, fixes #35459. (#37688)" (#37713)
git bisect bad 197c8c5d5de0e6e4b7243845127704581b0a97b2
# bad: [7e8f2c05cd39aec8d4349ec915bf5c68cdf4c072] Introduce `libjuliarepl` to break dependence on runtime libraries (#36588)
git bisect bad 7e8f2c05cd39aec8d4349ec915bf5c68cdf4c072
# first bad commit: [7e8f2c05cd39aec8d4349ec915bf5c68cdf4c072] Introduce `libjuliarepl` to break dependence on runtime libraries (#36588)

@KristofferC
Copy link
Member

@staticfloat, does that bisect makes sense to you?

@staticfloat
Copy link
Member

This is hilarious. Apparently, libm.so was previously getting linked in so early (e.g. before libopenlibm) that LLVM's lookup of pow() ended up using the system libm's version. Now that we use the loader, it correctly isolates us from the system libm, and we get OpenLibm's pow() function, which is slower.

You can verify this by using Profile and looking at the backtraces that it spits out.

@KristofferC
Copy link
Member

KristofferC commented Mar 26, 2021

@staticfloat Is there anything that should be done here? If the libm version is so much faster, does it make sense to use it intentionally? There are some regressions reported that are popping up that points to this (https://discourse.julialang.org/t/julia-1-6-versus-1-5-4/57993) for example. I see a 6x slowdown locally.

@stevengj
Copy link
Member

We just need @oscardssmith to write a Julia-native pow 😉

@oscardssmith
Copy link
Member

I'll see what I can do. 2 argument functions are about 1000x harder to verify though so it might take a bit.

@oscardssmith
Copy link
Member

The first thing I would need for this is someone to do the gpl dance with me on https://github.com/JuliaMath/openlibm/blob/master/src/e_pow.c (DO NOT OPEN IF YOU DON"T WANT THE GPL TO CONTAMINATE YOU). Any volunteers?

@Seelengrab
Copy link
Contributor

Seelengrab commented Mar 26, 2021

GPL dance as in clean-room implementor? I'd be up for it, provided there are no specific OS/system requirements (currently only Windows and/or WSL available here). I'd just need some information on when/how this would be done.

@oscardssmith
Copy link
Member

Thanks so much! Can you DM me in slack or discord?

@giordano
Copy link
Contributor

Why GPL? Am I missing something or there is no GPL anywhere in that code?

@vtjnash
Copy link
Member

vtjnash commented Mar 26, 2021

yeah, did you mean to send a different link?

 * ====================================================
 * Copyright (C) 2004 by Sun Microsystems, Inc. All rights reserved.
 *
 * Permission to use, copy, modify, and distribute this
 * software is freely granted, provided that this notice 
 * is preserved.
 * ====================================================

@oscardssmith
Copy link
Member

oscardssmith commented Mar 26, 2021

Wait really? I thought openlibm was gpl? If not, then that makes this much easier. Edit: I just assumed based on the library. I didn't actually open the link.

@giordano
Copy link
Contributor

I thought openlibm was gpl?

Not at all. Only a couple of test files are under LGPL (!= GPL), the rest has different more permissive licenses: https://github.com/JuliaMath/openlibm/blob/f052f42bb393918acd6fd51ec1f018c5679dfe30/LICENSE.md

@eirikbrandsaas
Copy link

This bites in many applications in economics, where power utility (x^a) is very frequently used (e.g., https://discourse.julialang.org/t/julia-1-6-versus-1-5-4/57993).

FWIW this also bites with negative integer powers:

@benchmark ($x)^(-2)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     65.689 ns (0.00% GC)
  median time:      67.702 ns (0.00% GC)
  mean time:        70.026 ns (0.00% GC)
  maximum time:     210.530 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     977

vs

@benchmark ($x)^(-2)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.909 ns (0.00% GC)
  median time:      13.443 ns (0.00% GC)
  mean time:        13.845 ns (0.00% GC)
  maximum time:     105.045 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

@KristofferC
Copy link
Member

Can we go back to using the libm so that things are the way they were? This big of a regression in basic functionality stings quite a lot.

@Pramodh-G
Copy link
Contributor

Just here to bump this.

@Pramodh-G
Copy link
Contributor

Hi, bumping again.

@Pramodh-G
Copy link
Contributor

Hi,
Not to be annoying, but bump. Is there anything that I can do to help this issue?
Thanks

@Jovansam
Copy link

Jovansam commented Sep 25, 2021

I really hope this gets fixed. I am stuck on 1.5.4 cause of a performance hit even in 1.6.3.

@stevengj
Copy link
Member

stevengj commented Sep 25, 2021

Should eventually be fixed by #42271?

@staticfloat
Copy link
Member

The situation has improved significantly on master:

version time
v1.5 21.3 ns
v1.6 84.6 ns
master 34.12 ns

This was tested on all three versions with the following bash loop:

for JULIA_EXE in julia-1.5.3 julia-1.6.3 julia-master; do
    echo $JULIA_EXE
    $JULIA_EXE -e 'using BenchmarkTools; x = 1.1; @btime ($x)^($x)';
done

We're not quite to the same speed as the old libm, but the situation is much more manageable as the implementation is now in pure Julia, and so we have a path forward to improving it.

@oscardssmith
Copy link
Member

darn, I was hoping I'd matched 1.5

@staticfloat
Copy link
Member

On the plus side, your ^(::Float64, ::Int64) is absolutely killing it. ;)

$ for J in julia-1.5.3 julia-1.6.3 julia-master; do echo $J; $J -e 'using BenchmarkTools; x = 1.1; p=-2; @btime ($x)^($p)'; done
julia-1.5.3
  21.486 ns (0 allocations: 0 bytes)
julia-1.6.3
  85.310 ns (0 allocations: 0 bytes)
julia-master
  5.708 ns (0 allocations: 0 bytes)

But of course, this issue is specifically `^(::Float64, ::Float64)

@oscardssmith
Copy link
Member

well float^float calls float^int when applicable... also since it's power by squaring, -2 is a decent bit faster than a bigger exponent.

@eirikbrandsaas
Copy link

Thanks for the work on this. This is good enough that I can finally switch my production code from 1.5 to 1.8

On my end we (you) are halfway there:

for JULIA_EXE in ~/julia-1.5.4/bin/julia ~/julia-1.7.3/bin/julia ~/julia-1.8.0-rc1/bin/julia; do     echo $JULIA_EXE;     $JULIA_EXE -e 'using BenchmarkTools; x = 1.1; @btime ($x)^($x)'; done
/home/eirik/julia-1.5.4/bin/julia
  14.383 ns (0 allocations: 0 bytes)
/home/eirik/julia-1.7.3/bin/julia
  66.392 ns (0 allocations: 0 bytes)
/home/eirik/julia-1.8.0-rc1/bin/julia
  26.401 ns (0 allocations: 0 bytes)

Tagging @Jovansam and @Pramodh-G so they see the improvement and @oscardssmith to say thanks <3

@oscardssmith
Copy link
Member

Try out 1.9. It should be in the ~18ns range. (I sped up extended precision log by about 2x since 1.8)

@eirikbrandsaas
Copy link

for JULIA_EXE in ~/julia-1.5.4/bin/julia ~/julia-1.7.3/bin/julia ~/julia-1.8.0-rc1/bin/julia ~/julia-803f90db91/bin/julia; do     echo $JULIA_EXE;     $JULIA_EXE -e 'using BenchmarkTools; x = 1.1; @btime ($x)^($x)'; done
/home/eirik/julia-1.5.4/bin/julia
  14.727 ns (0 allocations: 0 bytes)
/home/eirik/julia-1.7.3/bin/julia
  57.552 ns (0 allocations: 0 bytes)
/home/eirik/julia-1.8.0-rc1/bin/julia
  26.404 ns (0 allocations: 0 bytes)
/home/eirik/julia-803f90db91/bin/julia
  14.416 ns (0 allocations: 0 bytes)

The nightly is even faster than 1.5!

@oscardssmith
Copy link
Member

I believe that means we can close this. There's still some more that can be done here, but we are no longer slower on linux and windows (I think M1 mac is still faster with the Mac libm)

@Jovansam
Copy link

Jovansam commented Nov 22, 2022

@oscardssmith many thanks for all the work you have put into this. Much appreciated. @eirikbrandsaas jealous of your use case.
Just wanted to flag, I tried 1.9 for my use case, see code posted below and here, 1.5 is still superior but I neither see much improvement of 1.9 over 1.6/1.7/1.8.

#using Plots
using BenchmarkTools


function optimal_growth_model()
    σ = 1.5
    δ = 0.1
    β = 0.95
    α = 0.30
    ks = ((1-β*(1-δ))/(α*β))^(1/(α-1))
    csy = 1-α*β*δ/(1-β*(1-δ))
    Δ = 0.9
    kmin = (1-Δ)*ks
    kmax = (1+Δ)*ks
    Nk = 1000
    ∂k = (kmax-kmin)/(Nk-1)
    k = kmin:∂k:kmax
    v′ = zeros(Nk)
    i_k′ = zeros(Int64,Nk)
    v′ = (((csy*k.^α)).^(1-σ) .- 1) ./ ((1-σ)*(1-β))
    #v = Array{Float64}(undef,Nk)
    v = zeros(Nk)

    iter, crit, tol = 1,1.0, 1e-6

    while crit>tol
        for i in eachindex(1:Nk)
            # start by narrowing the interval on which to look for the solution
            i_min = Int64(max(ceil(((1-δ)*k[i] - kmin)/∂k)+1,1))
            i_max = Int64(min(floor((k[i]^α+(1-δ)*k[i]-kmin)/∂k)+1,Nk))

            # find the optimal consumption profile for the much narrow interval
            #note that there is no need for a second-loop here as the operation is vectorized
            c = k[i]^α + (1-δ)*k[i] .- k[i_min:i_max]
            u = (c.^(1-σ) .-1)/(1-σ)

            # for the narrow interval now find the consumption choice that minimizes the consumption profile
            #(v[i],i_tmp) = findmax(u+β*v′[i_min:i_max],dims=1) #this is nK×1 vector. Operate/vary on rows
            res = findmax(u+β*v′[i_min:i_max],dims=1) #this is nK×1 vector. Operate/vary on rows

            # extract both the maximizer and the index that we shall later use to recover the optimal consumption profile
            v[i] = res[1][1]    # we index multiple times to make types consistent
            i_tmp = res[2][1]    # we index multiple times to make types consistent
            i_k′[i] = i_min-1+i_tmp
        end
        crit = maximum(abs.(v-v′))
        v′ .= v
        iter+=1
    end
    k′ = k[i_k′]
    c = k.^α + (1-δ)*k - k′
    u = (c.^(1-σ) .- 1)/(1-σ)
    v = u/(1-β)
    return k′,k,c,u,v
end

function plot_figure()
    k′,k,c,u,v = optimal_growth_model()
    #plt1 = plot(k,[k′ k],label=["k′ vs k" "k vs k"],leg=:bottomright,lw=2)
    #plt2 = plot(k,c,label="c vs k",leg=:bottomright,lw=2)
    #plt3 = plot(k,u,label="u vs k",leg=:bottomright,lw=2)
    #plt4 = plot(k,v,label="v vs k",leg=:bottomright,lw=2)
    #plot(plt1,plt2,plt3,plt4)
end

@btime plot_figure()

@oscardssmith
Copy link
Member

That's very odd. On my computer I see 1.3 seconds for 1.9 (technically 1.10 but there haven't been any changes that would affect this since 1.9) compared to 4.8 seconds on 1.6

@Jovansam
Copy link

That's very odd. On my computer I see 1.3 seconds for 1.9 (technically 1.10 but there haven't been any changes that would affect this since 1.9) compared to 4.8 seconds on 1.6

Very interesting. I must be doing something wrong then? How are you running the code. I am using include("...") in the 1.9 REPL on Windows 11 64-bit.

@oscardssmith
Copy link
Member

oscardssmith commented Nov 22, 2022

What do you get for the originally reported regression (i.e.)

julia> using BenchmarkTools
julia> x = 2.0;
julia> @benchmark ($x)^(7/6)

on 1.5, 1.6 and 1.9?

Also, what processor do you have?

@Jovansam
Copy link

What do you get for the originally reported regression (i.e.)

julia> using BenchmarkTools
julia> x = 2.0;
julia> @benchmark ($x)^(7/6)

on 1.5, 1.6 and 1.9?

Also, what processor do you have?

1.10 = 47.785 ns
1.83 = 34.778 ns
1.6.7 = 36.354 ns
1.5.4 = 98.092 ns

Processor: Intel Core i7-8750H @ 2.20 GHz

@staticfloat
Copy link
Member

Just to point out the somewhat obvious; it appears in the benchmark you just ran, you're getting significantly better performance in 1.10 than in 1.5 (as expected in this issue). I suggest you open a new issue for your code, as the problem is likely somewhere else than in the non-integer powers.

@oscardssmith
Copy link
Member

however those numbers make no sense since 1.10 should be strictly better than 1.8. They're both pure Julia implementations and the 1.10 one is faster on everything I've tested it on.

@Jovansam
Copy link

however those numbers make no sense since 1.10 should be strictly better than 1.8. They're both pure Julia implementations and the 1.10 one is faster on everything I've tested it on.

Many thanks both. Run again and obtained same ranking. I will try on a different computer later today and report.

@Jovansam
Copy link

I have now tested my use case pasted above on a second laptop with 11th Gen Intel Core i5-1135G7 @ 2.4 GHz processor. This laptop is less stacked than my other laptop, but for the slightly faster processor. Julia 1.5 computes in 0.7 ns, 1.8 in 3 s, and 1.10 in 2s. 1.5 is 6 and 4 times faster than 1.8 and 1.10 respectively.
For the small example @oscardssmith you pasted, 1.8 and 1.10 give 20 and 19 ns respectively. 1.5 gives 61 ns. So big improvement here. At the moment, I can only conclude that outcomes really depend on how the machine is built. Both my machines are Lenovos (Extreme (the one I used earlier) & X13 (the one I just used)) running 64 bit Windows 11.

@oscardssmith
Copy link
Member

are you sure your example is botlenecked by floating point pow? If the pure floating point pow is faster on 1.6/1.8 than 1.5 and your code is slower, I have a hard time believing that floating point pow is the problem.

@alecloudenback
Copy link
Contributor

alecloudenback commented Feb 25, 2024

Not sure if a new issue should be created or if this should be re-opened... but there's a discourse post indicating that 1.10 is 2-4x slower than prior versions: https://discourse.julialang.org/t/exponentiation-of-floats/105951/18

And further comparison: 1.10.1 is ~2x slower than numpy and ~6x slower than pytorch on a Mac M2:

❯ python3 -m timeit  -s "import torch;m = torch.rand(10000, 241, dtype=torch.float64)" "m ** (1/12)"
50 loops, best of 5: 4.49 msec per loop
❯ python3 -m timeit  -s "import numpy;m = numpy.random.rand(10000, 241)" "m ** (1/12)"
20 loops, best of 5: 18.5 msec per loop
julia> @btime m .^ (1/12) setup=m=rand(10000,241)
  29.251 ms (2 allocations: 18.39 MiB)

Timing for v1.9.3 was 26.246ms.

I echo the sentiment above that fast powers is important in economics/finance.

Edit - some more testing:

Nightly (b8a0a39) is a little bit slower than 1.10:

julia> @btime m .^ (1/12) setup=m=rand(10000,241)
  31.360 ms (3 allocations: 18.39 MiB)

AppleAccelerate is 3x faster:

julia> m = rand(10000,241);  p = fill(1/12,10000,241)
julia> @btime $m .^ $p
  26.818 ms (2 allocations: 18.39 MiB)
julia> @btime AppleAccelerate.pow($m,$p)
  9.662 ms (2 allocations: 18.39 MiB)

@SyxP
Copy link
Contributor

SyxP commented Feb 25, 2024

This should be fixed in 1.11. #52079

@alecloudenback
Copy link
Contributor

This should be fixed in 1.11. #52079

On an M2 Mac, the scalar power timing doesn't seem to match what that issue describes (see below). Plus it also doesn't explain the vector performance vs numpy/pytorch. This seems to be consistent with the OP in the linked issue, which shows numpy calculating 100 floats in a vector in the time it would take Julia to do 50.

julia> @btime x^y setup=begin x=rand(); y=rand()end #1.9.4
  10.844 ns (0 allocations: 0 bytes)

julia> @btime x^y setup=begin x=rand(); y=rand()end  # 1.10.1
  11.720 ns (0 allocations: 0 bytes)

julia> @btime x^y setup=begin x=rand(); y=rand()end #nightly
  11.887 ns (0 allocations: 0 bytes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests