fix `Math.pow_body` for huge powers #53886

KlausC · 2024-03-28T18:44:45Z

Fix Math.pow_body(a, n) for huge values of n.

base/math.jl

oscardssmith · 2024-03-28T19:37:22Z

Is my understanding correct that the theory behind this is that it moves the first iteration out of the loop? It's somewhat unclear to me why this makes a difference.

KlausC · 2024-03-28T20:01:17Z

Is my understanding correct that the theory behind this is that it moves the first iteration out of the loop? It's somewhat unclear to me why this makes a difference.

Yes, I took the first execution out of the loop, because I wanted to half the size of n, so n = -n does not break.
As a side-effect, the inverse of x was not required, but of x * x, which made the difference for x = prevfloat(1.0).

julia> x
0.9999999999999999

julia> xx = x * x
0.9999999999999998

julia> (inv(x) - inv(big(x))) / eps(inv(x))
0.4999999999999999444888487687421668158425945563035703681048001261471687007391349

julia> (inv(xx) - inv(big(xx))) / eps(inv(xx))
-2.220446049250313573885329099314128483772878678075059462127716467837134251888541e-16

base/math.jl

oscardssmith · 2024-03-28T20:34:37Z

Overall this looks good. I do want to do a little more testing with random giant comments to make sure this fully fixes the issues.

KlausC

The previous test run failed for two 32-bit AMD cpus. Reverted muladd to fma to check out, if that's the reason.

That was indeed the reason. So on these architectures, muladd behaves worse compared to and fma!!!

base/math.jl

KlausC · 2024-04-02T07:26:18Z

The current state of the PR is an essential improvement in accuracy (only 150 ppm more than 1 ULP error) for big exponents, while performance is maintained. (The recent implementation has more than 34% in the same sampled argument area).
For relative small negative exponents needs more evaluation.

KlausC · 2024-04-03T15:15:17Z

The statistics are perfect now: 99.4 % of the samples now show 0 ULPS, 0.6% 1 ULPS error, no sample with more!

# Testing code:
using BenchmarkTools

function ulps(x, n, p=^)
           a = p(x, n)
           b = big(x) ^ n
           u = abs(a - Float64(b)) / eps(a)
           return isfinite(u) ? BigInt(ceil(u)) : -1
end

function randpower(; minex=3, maxex=typemax(Int64))
           maxex = min(maxex, log2(floatmax()) / log2(nextfloat(1.0)))
           minex = max(minex, 1)
           l1, l2 = sqrt(log2(minex)), sqrt(log2(maxex))
           ab = NaN
           while !isfinite(ab)
               ab = reinterpret(Float64, rand(UInt64))
           end
           s = sign(ab)
           ab = log(abs(ab))
           lb = (rand() * (l2 - l1) + l1) ^ 2
           b = Int64(floor(2.0 ^ lb)) * rand([-1, 1])
           ab /= b
           a = exp(ab) * s
           c = a^b
           if !isfinite(c)
               a *= s
               b = b ÷ 2
               c = a^b
           end
           return b, a
 end

function stat_ulps(nx::AbstractVector{<:Tuple{<:Integer,<:AbstractFloat}})
           D = Dict{Integer,Any}()
           try
               for (b, a) in nx
                   u = ulps(a, b)
                   R = get!(D, u) do; [] end;
                   t = (b, a)
                   push!(R, t)
               end
           finally
               ;
           end
           for k in keys(D)
               D[k] = sort(collect(Set(D[k])); by = t-> abs(t[1]) + log2(abs(t[2])+eps()) * eps())
           end
           return D
 end
function overview(D::Dict{<:Integer,<:Any}, maxulps=53)
                  s0 = s1 = s2 = 0.0
                  m = NaN
                  ky = sort!(collect(keys(D)))
                  n = sum(length(D[k]) for k in ky); println(n)
                  smax = 0
                  for k in ky
                      l = float(length(D[k]))
                      k > maxulps && (k = maxulps + 1)
                      if s0 + l > n / 2 && isnan(m)
                          m = k - 0.5 + (n/2 - s0) / l
                      end
                      s0 += l
                      s1 += l * k
                      s2 += l * k * k
                      if k <= maxulps
                          println("$k => $l")
                      else
                          smax += l
                      end
                  end
                  if smax > 0
                      println(">$maxulps => $smax")
                  end
                  u = round(s1 / s0; digits = 3)
                  v = round(sqrt( s2 / s0 - u * u); digits=3)
                  u, m, v = Float64.(round.((u, m, v); digits=3))
                  println("total: n: $s0 - median: $m - mean: $u ± $v ulps")
       end

nx = [randpower() for _ in 1:10^6];
D = stat_ulps(nx);
overview(D)

@benchmark [x^n for (n, x) in nx]

Output with the version of this PR:

julia> nx = [randpower() for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D)
1000000
0 => 994416.0
1 => 5584.0
total: n: 1.0e6 - median: 0.003 - mean: 0.006 ± 0.074 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 46 samples with 1 evaluation.
 Range (min … max):  108.233 ms … 117.334 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     108.887 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   109.416 ms ±   1.533 ms  ┊ GC (mean ± σ):  0.24% ± 0.44%

   █                                                             
  ▄█▆▁▁▁▇▅▃▃▁▁▁▁▃▇▃▁▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  108 ms           Histogram: frequency by time          117 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 4.

The output of the same commands on v10.2

julia> nx = [randpower() for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D, 10)
1000000
0 => 627937.0
1 => 18924.0
2 => 6065.0
3 => 4011.0
4 => 3136.0
5 => 2520.0
6 => 2157.0
7 => 1774.0
8 => 1626.0
9 => 1494.0
10 => 1303.0
>10 => 329053.0
total: n: 1.0e6 - median: 0.296 - mean: 3.753 ± 5.142 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 53 samples with 1 evaluation.
 Range (min … max):  94.458 ms … 101.908 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     94.580 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   95.161 ms ±   1.386 ms  ┊ GC (mean ± σ):  0.13% ± 0.44%

  ▃█▂                                                           
  ███▅▁▁▅▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▅▇▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▅ ▁
  94.5 ms       Histogram: log(frequency) by time      98.8 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

oscardssmith · 2024-04-03T15:18:34Z

That looks great! It's also really nice to have a second person that actually understands this code well. Can you also post the benchmarks for nx = rand(1:20, 10^6)? I do think in the real world most uses are with pretty small powers so it would be good to make sure that those don't regress.

KlausC · 2024-04-03T15:25:10Z

In the realm of the old implementation, exponents are less than 2^25, the results are comparable both in accuracy and performance:

The benchmark evaluates 10^6 powers and requires 59 or 51 ns mean value.

New:

julia> nx = [randpower(maxex = 2^25) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D)
1000000
0 => 994076.0
1 => 5924.0
total: n: 1.0e6 - median: 0.003 - mean: 0.006 ± 0.077 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 84 samples with 1 evaluation.
 Range (min … max):  59.079 ms … 67.466 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     59.197 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.655 ms ±  1.073 ms  ┊ GC (mean ± σ):  0.46% ± 0.68%

  ▅█▂         ▂▄▂▁                                             
  ███▅▁▁▁▁▁▁▁▁████▅▅▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▇▅▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  59.1 ms      Histogram: log(frequency) by time      62.3 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 4.

For comparison the old implementation:

julia> nx = [randpower(maxex = 2^25) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D, 10)
1000000
0 => 993824.0
1 => 6176.0
total: n: 1.0e6 - median: 0.003 - mean: 0.006 ± 0.078 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 97 samples with 1 evaluation.
 Range (min … max):  51.343 ms … 59.751 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     51.445 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   51.942 ms ±  1.381 ms  ┊ GC (mean ± σ):  0.11% ± 0.36%

  █▃             ▁                                             
  ██▄▁▄█▄▁▁▁▁▁▁▁▁█▆▁▁▁▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  51.3 ms      Histogram: log(frequency) by time      59.2 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

Baseline for the benchmark:

julia> @benchmark [x*n for (n, x) in nx]
BenchmarkTools.Trial: 3076 samples with 1 evaluation.
 Range (min … max):  1.493 ms …   3.959 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.543 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.620 ms ± 214.669 μs  ┊ GC (mean ± σ):  2.72% ± 6.65%

  ▂▆██▇▆▄▃▂▂▂▂▂                               ▁▄▃▂▁           ▁
  ██████████████▇▇██▇▇▆▇▆▅▄▆▆▇▅▅▄▆▅▅▅▆▄▄▄▅▄▆▆▅█████▇▇▇▅▅▅▄▁▄▅ █
  1.49 ms      Histogram: log(frequency) by time      2.22 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 3.

oscardssmith · 2024-04-03T15:32:06Z

I also wonder whether it would be worth making it so that big powers go to the regular Float64^Float64 path. My bench-marking suggests that around 10000 the Float64 path is faster.

KlausC · 2024-04-03T15:54:47Z

Can you also post the benchmarks for nx = rand(1:20, 10^6)

Do you mean restrict the exponents to -20:20? Here we go:

julia> nx = [randpower(maxex = 20) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D)
1000000
0 => 974132.0
1 => 25868.0
total: n: 1.0e6 - median: 0.013 - mean: 0.026 ± 0.159 ulps

julia> @benchmark [x^n for (n, x) in nx]
BenchmarkTools.Trial: 199 samples with 1 evaluation.
 Range (min … max):  24.223 ms … 31.170 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.711 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.151 ms ±  1.097 ms  ┊ GC (mean ± σ):  1.18% ± 1.73%

   ▂██▁    ▇                                                   
  ▃████▂▂▄██▇▅▃▃▁▂▁▁▁▁▁▁▁▂▃▄▃▄▁▁▂▁▁▁▃▃▁▁▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  24.2 ms         Histogram: frequency by time        30.1 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 4.

julia> VERSION
v"1.12.0-DEV.292"

julia> nx = [randpower(maxex = 20) for _ in 1:10^6];

julia> D = stat_ulps(nx);

julia> overview(D, 10)
1000000
0 => 974340.0
1 => 25660.0
total: n: 1.0e6 - median: 0.013 - mean: 0.026 ± 0.158 ulps

julia> @benchmark foo($nx)
BenchmarkTools.Trial: 240 samples with 1 evaluation.
 Range (min … max):  20.665 ms …  26.005 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.741 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.899 ms ± 577.308 μs  ┊ GC (mean ± σ):  0.27% ± 0.84%

  ▇█▅▃                                                          
  ████▆▁▁▁▁▅▁▁▅▆█▆▆▁▅▄▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▁▄▁▁▁▁▁▁▆ ▅
  20.7 ms       Histogram: log(frequency) by time      23.5 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> VERSION
v"1.10.2"

KlausC · 2024-04-03T16:03:53Z

whether it would be worth making it so that big powers go to the regular Float64^Float64 path.

I will check, how performance verse accuracy works out.

KlausC · 2024-04-03T21:26:45Z

I will check, how performance verse accuracy works out.

@oscardssmith yes, that would be a better way. I wasn't aware that we have a double precision implementation of log/exp behind the scenes. That has high accuracy for all integer exponents which can be represented as Float64 and superior performance figures for exponents maybe -10000 < n < 10000 (Threshold to be evaluated). For integers, which remain, (Int64(Float64(n)) != n, a special handling is required.

This is a separate PR, though. If that will be done, this PR is required not anymore.

Nevertheless it was a challenging task for me to get out the last ULP of the algorithm, having much fun!
Essential algorithmic improvement was the merging of the hi and lo precision parts of the variables more or less frequently, before the lo part exploded in size. For the bases less than 1, with negative exponents, special handling was needed. That all came with a price of 20-30% performance reduction, compared with the existing implementation.

Note, that there is still issue #53881.

KlausC · 2024-04-14T12:56:21Z

This PR is superseded with #53967 and closed.

Improve performance of `^(::Float64, n::Integer)` in the case of `abs(n) > 2^13`. While `pow_body` is unreliable for `abs(n) > 2^25` this implementation provides errors of a few ULPs, while runtime is capped to that of the `Float64` implementation. Fixes #53881 See also #53886.

fix Math.pow_body for huge powers

979027b