Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Math.pow_body for huge powers #53886

Closed
wants to merge 20 commits into from
Closed

Conversation

KlausC
Copy link
Contributor

@KlausC KlausC commented Mar 28, 2024

Fix Math.pow_body(a, n) for huge values of n.

Fixes #53881

base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
@oscardssmith
Copy link
Member

Is my understanding correct that the theory behind this is that it moves the first iteration out of the loop? It's somewhat unclear to me why this makes a difference.

@KlausC
Copy link
Contributor Author

KlausC commented Mar 28, 2024

Is my understanding correct that the theory behind this is that it moves the first iteration out of the loop? It's somewhat unclear to me why this makes a difference.

Yes, I took the first execution out of the loop, because I wanted to half the size of n, so n = -n does not break.
As a side-effect, the inverse of x was not required, but of x * x, which made the difference for x = prevfloat(1.0).

julia> x
0.9999999999999999

julia> xx = x * x
0.9999999999999998

julia> (inv(x) - inv(big(x))) / eps(inv(x))
0.4999999999999999444888487687421668158425945563035703681048001261471687007391349

julia> (inv(xx) - inv(big(xx))) / eps(inv(xx))
-2.220446049250313573885329099314128483772878678075059462127716467837134251888541e-16

base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
@oscardssmith
Copy link
Member

Overall this looks good. I do want to do a little more testing with random giant comments to make sure this fully fixes the issues.

@KlausC KlausC marked this pull request as draft March 29, 2024 03:56
Copy link
Contributor Author

@KlausC KlausC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous test run failed for two 32-bit AMD cpus. Reverted muladd to fma to check out, if that's the reason.

That was indeed the reason. So on these architectures, muladd behaves worse compared to and fma!!!

base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
base/math.jl Outdated Show resolved Hide resolved
@KlausC

This comment was marked as outdated.

@KlausC KlausC marked this pull request as ready for review April 2, 2024 07:10
@KlausC
Copy link
Contributor Author

KlausC commented Apr 2, 2024

The current state of the PR is an essential improvement in accuracy (only 150 ppm more than 1 ULP error) for big exponents, while performance is maintained. (The recent implementation has more than 34% in the same sampled argument area).
For relative small negative exponents needs more evaluation.

@KlausC KlausC marked this pull request as draft April 2, 2024 07:47
@KlausC
Copy link
Contributor Author

KlausC commented Apr 3, 2024

The statistics are perfect now: 99.4 % of the samples now show 0 ULPS, 0.6% 1 ULPS error, no sample with more!

# Testing code:
using BenchmarkTools

function ulps(x, n, p=^)
           a = p(x, n)
           b = big(x) ^ n
           u = abs(a - Float64(b)) / eps(a)
           return isfinite(u) ? BigInt(ceil(u)) : -1
end

function randpower(; minex=3, maxex=typemax(Int64))
           maxex = min(maxex, log2(floatmax()) / log2(nextfloat(1.0)))
           minex = max(minex, 1)
           l1, l2 = sqrt(log2(minex)), sqrt(log2(maxex))
           ab = NaN
           while !isfinite(ab)
               ab = reinterpret(Float64, rand(UInt64))
           end
           s = sign(ab)
           ab = log(abs(ab))
           lb = (rand() * (l2 - l1) + l1) ^ 2
           b = Int64(floor(2.0 ^ lb)) * rand([-1, 1])
           ab /= b
           a = exp(ab) * s
           c = a^b
           if !isfinite(c)
               a *= s
               b = b ÷ 2
               c = a^b
           end
           return b, a
 end

function stat_ulps(nx::AbstractVector{<:Tuple{<:Integer,<:AbstractFloat}})
           D = Dict{Integer,Any}()
           try
               for (b, a) in nx
                   u = ulps(a, b)
                   R = get!(D, u) do; [] end;
                   t = (b, a)
                   push!(R, t)
               end
           finally
               ;
           end
           for k in keys(D)
               D[k] = sort(collect(Set(D[k])); by = t-> abs(t[1]) + log2(abs(t[2])+eps()) * eps())
           end
           return D
 end
function overview(D::Dict{<:Integer,<:Any}, maxulps=53)
                  s0 = s1 = s2 = 0.0
                  m = NaN
                  ky = sort!(collect(keys(D)))
                  n = sum(length(D[k]) for k in ky); println(n)
                  smax = 0
                  for k in ky
                      l = float(length(D[k]))
                      k > maxulps && (k = maxulps + 1)
                      if s0 + l > n / 2 && isnan(m)
                          m = k - 0.5 + (n/2 - s0) / l
                      end
                      s0 += l
                      s1 += l * k
                      s2 += l * k * k
                      if k <= maxulps
                          println("$k => $l")
                      else
                          smax += l
                      end
                  end
                  if smax > 0
                      println(">$maxulps => $smax")
                  end
                  u = round(s1 / s0; digits = 3)
                  v = round(sqrt( s2 / s0 - u * u); digits=3)
                  u, m, v = Float64.(round.((u, m, v); digits=3))
                  println("total: n: $s0 - median: $m - mean: $u ± $v ulps")
       end

nx = [randpower() for _ in 1:10^6];
D = stat_ulps(nx);
overview(D)

@benchmark [x^n for (n, x) in nx]

Output with the version of this PR:

julia> nx = [randpower() for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D)
1000000
0 => 994416.0
1 => 5584.0
total: n: 1.0e6 - median: 0.003 - mean: 0.006 ± 0.074 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 46 samples with 1 evaluation.
 Range (min … max):  108.233 ms … 117.334 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     108.887 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   109.416 ms ±   1.533 ms  ┊ GC (mean ± σ):  0.24% ± 0.44%

   █                                                             
  ▄█▆▁▁▁▇▅▃▃▁▁▁▁▃▇▃▁▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  108 ms           Histogram: frequency by time          117 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 4.

The output of the same commands on v10.2

julia> nx = [randpower() for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D, 10)
1000000
0 => 627937.0
1 => 18924.0
2 => 6065.0
3 => 4011.0
4 => 3136.0
5 => 2520.0
6 => 2157.0
7 => 1774.0
8 => 1626.0
9 => 1494.0
10 => 1303.0
>10 => 329053.0
total: n: 1.0e6 - median: 0.296 - mean: 3.753 ± 5.142 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 53 samples with 1 evaluation.
 Range (min … max):  94.458 ms … 101.908 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     94.580 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   95.161 ms ±   1.386 ms  ┊ GC (mean ± σ):  0.13% ± 0.44%

  ▃█▂                                                           
  ███▅▁▁▅▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▅▇▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▅ ▁
  94.5 ms       Histogram: log(frequency) by time      98.8 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

@oscardssmith
Copy link
Member

That looks great! It's also really nice to have a second person that actually understands this code well. Can you also post the benchmarks for nx = rand(1:20, 10^6)? I do think in the real world most uses are with pretty small powers so it would be good to make sure that those don't regress.

@KlausC
Copy link
Contributor Author

KlausC commented Apr 3, 2024

In the realm of the old implementation, exponents are less than 2^25, the results are comparable both in accuracy and performance:

The benchmark evaluates 10^6 powers and requires 59 or 51 ns mean value.

New:

julia> nx = [randpower(maxex = 2^25) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D)
1000000
0 => 994076.0
1 => 5924.0
total: n: 1.0e6 - median: 0.003 - mean: 0.006 ± 0.077 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 84 samples with 1 evaluation.
 Range (min … max):  59.079 ms … 67.466 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     59.197 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.655 ms ±  1.073 ms  ┊ GC (mean ± σ):  0.46% ± 0.68%

  ▅█▂         ▂▄▂▁                                             
  ███▅▁▁▁▁▁▁▁▁████▅▅▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▇▅▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  59.1 ms      Histogram: log(frequency) by time      62.3 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 4.

For comparison the old implementation:

julia> nx = [randpower(maxex = 2^25) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D, 10)
1000000
0 => 993824.0
1 => 6176.0
total: n: 1.0e6 - median: 0.003 - mean: 0.006 ± 0.078 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 97 samples with 1 evaluation.
 Range (min … max):  51.343 ms … 59.751 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     51.445 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.942 ms ±  1.381 ms  ┊ GC (mean ± σ):  0.11% ± 0.36%

  █▃             ▁                                             
  ██▄▁▄█▄▁▁▁▁▁▁▁▁█▆▁▁▁▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  51.3 ms      Histogram: log(frequency) by time      59.2 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

Baseline for the benchmark:

julia> @benchmark [x*n for (n, x) in nx]
BenchmarkTools.Trial: 3076 samples with 1 evaluation.
 Range (min … max):  1.493 ms …   3.959 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.543 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.620 ms ± 214.669 μs  ┊ GC (mean ± σ):  2.72% ± 6.65%

  ▂▆██▇▆▄▃▂▂▂▂▂                               ▁▄▃▂▁           ▁
  ██████████████▇▇██▇▇▆▇▆▅▄▆▆▇▅▅▄▆▅▅▅▆▄▄▄▅▄▆▆▅█████▇▇▇▅▅▅▄▁▄▅ █
  1.49 ms      Histogram: log(frequency) by time      2.22 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

@oscardssmith
Copy link
Member

I also wonder whether it would be worth making it so that big powers go to the regular Float64^Float64 path. My bench-marking suggests that around 10000 the Float64 path is faster.

@KlausC
Copy link
Contributor Author

KlausC commented Apr 3, 2024

Can you also post the benchmarks for nx = rand(1:20, 10^6)

Do you mean restrict the exponents to -20:20? Here we go:

julia> nx = [randpower(maxex = 20) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D)
1000000
0 => 974132.0
1 => 25868.0
total: n: 1.0e6 - median: 0.013 - mean: 0.026 ± 0.159 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 199 samples with 1 evaluation.
 Range (min … max):  24.223 ms … 31.170 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.711 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.151 ms ±  1.097 ms  ┊ GC (mean ± σ):  1.18% ± 1.73%

   ▂██▁    ▇                                                   
  ▃████▂▂▄██▇▅▃▃▁▂▁▁▁▁▁▁▁▂▃▄▃▄▁▁▂▁▁▁▃▃▁▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  24.2 ms         Histogram: frequency by time        30.1 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 4.

julia> VERSION
v"1.12.0-DEV.292"
julia> nx = [randpower(maxex = 20) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D, 10)
1000000
0 => 974340.0
1 => 25660.0
total: n: 1.0e6 - median: 0.013 - mean: 0.026 ± 0.158 ulps

julia> @benchmark foo($nx)
BenchmarkTools.Trial: 240 samples with 1 evaluation.
 Range (min … max):  20.665 ms …  26.005 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.741 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.899 ms ± 577.308 μs  ┊ GC (mean ± σ):  0.27% ± 0.84%

  ▇█▅▃                                                          
  ████▆▁▁▁▁▅▁▁▅▆█▆▆▁▅▄▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▁▄▁▁▁▁▁▁▆ ▅
  20.7 ms       Histogram: log(frequency) by time      23.5 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> VERSION
v"1.10.2"

@KlausC
Copy link
Contributor Author

KlausC commented Apr 3, 2024

whether it would be worth making it so that big powers go to the regular Float64^Float64 path.

I will check, how performance verse accuracy works out.

@KlausC
Copy link
Contributor Author

KlausC commented Apr 3, 2024

I will check, how performance verse accuracy works out.

@oscardssmith yes, that would be a better way. I wasn't aware that we have a double precision implementation of log/exp behind the scenes. That has high accuracy for all integer exponents which can be represented as Float64 and superior performance figures for exponents maybe -10000 < n < 10000 (Threshold to be evaluated). For integers, which remain, (Int64(Float64(n)) != n, a special handling is required.

This is a separate PR, though. If that will be done, this PR is required not anymore.

Nevertheless it was a challenging task for me to get out the last ULP of the algorithm, having much fun!
Essential algorithmic improvement was the merging of the hi and lo precision parts of the variables more or less frequently, before the lo part exploded in size. For the bases less than 1, with negative exponents, special handling was needed. That all came with a price of 20-30% performance reduction, compared with the existing implementation.

Note, that there is still issue #53881.

@KlausC KlausC marked this pull request as ready for review April 3, 2024 21:26
@KlausC
Copy link
Contributor Author

KlausC commented Apr 14, 2024

This PR is superseded with #53967 and closed.

@KlausC KlausC closed this Apr 14, 2024
oscardssmith pushed a commit that referenced this pull request Apr 22, 2024
Improve performance of `^(::Float64, n::Integer)` in the case of `abs(n)
> 2^13`.

While `pow_body` is unreliable for `abs(n) > 2^25` this implementation
provides errors of a few ULPs, while runtime is capped to that of the
`Float64` implementation.

Fixes  #53881
See also #53886.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ^(::Float64, ::Union{Int, Float64}) incorrect for very large negative exponents
2 participants