-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
math: FMA is slower than non-FMA calculation #36196
Comments
Which CPU is used for the benchmark? Please benchmark with GODEBUG=cpu.fma=off and see if this changes anything. If FMA instructions are used I would expect a change in the FMA benchmark numbers. Please also write the benchmark as a go benchmark: Then execute the test with -count=20 and store the results in a file. Use a quiet machine with e.g. no browser or videos running. Afterwards use https://godoc.org/golang.org/x/tools/cmd/benchcmp This will give better information how consistent between runs the results are. |
Intel Core i5 4460 package fma_test
import (
"math"
"math/rand"
"testing"
)
var va, vb, vc, vd float64
func pure_fma_func() {
vd = math.FMA(va, vb, vc)
}
func non_fma_func() {
vd = va*vb + vc
}
func BenchmarkFMA(b *testing.B) {
va = rand.Float64()
vb = rand.Float64()
vc = rand.Float64()
b.ResetTimer()
for i := 0; i < b.N; i++ {
pure_fma_func()
}
}
func BenchmarkNonFMA(b *testing.B) {
va = rand.Float64()
vb = rand.Float64()
vc = rand.Float64()
b.ResetTimer()
for i := 0; i < b.N; i++ {
non_fma_func()
}
}
old
new
benchstat
|
Added old
new
benchstat
|
I missed you were using windows on which GO does not support using GODEBUG to disable cpu features used by GO. I should consider printing a warning. Disasembly for me shows:
Which looks wrong since its missing the VFMADD231SD instruction. Using
Running with
but both is slower than without FMA. Note that the FMA version has more precision as it does not do the 64bit rounding between the steps. |
VFMADD231SD has a 5 cycle latency on Haswell and 2 can be executed in parallel. Same for MULSD. The added check and jump as well as other factors can make the FMA indeed slower. This seems to be 1 or 2 cycles here. This might be WAI due to a slight overhead for runtime dispatch within the loop. |
To verify that this is the runtime dispatch overhead to determine if the cpu supports FMA I changed the compiler in
As noted even if equally fast FMA has the advantage of not rounding the intermediate step. As long as the build go binary needs to support both FMA capable and non FMA capable cpus there will be some overhead. Ideally that could be moved outside the loop but we do not have that currently. For the later I thought we already had a general bug to move the checks. |
The closest bug I could find related to this is #34950, which is intended to mark ops requiring feature detection such that the check isn't optimized away. However, there likely needs to be some issue to track the hoisting optimization for loop invariants. This will likely become even more important if vector intrinsics land in the stdlib someday. |
Change https://golang.org/cl/212360 mentions this issue: |
Running @josharian's CL (212360) provides ~16% on my windows laptop that had the same overhead mentioned above. old.txt
new.txt
benchstat old.txt new.txt
|
Are you sure the compiler isn’t optimising you’re code away? |
@davecheney Just checked and it is not. Disassembling the binary with
although objdump doesn't know about the VFMA* instructions yet, so that roughly translates to (in intel syntax)
|
have a look at what the benchmark loop is compiling to. |
Ahh, ignore me, I missed that va, vb, etc were package level decls |
/cc @griesemer @rsc |
As investigated above the slow down is caused by dynamically checking on every iteration that the FMA CPU capability is present. math.FMA is still useful even if slower as it has more precision then doing the computation with temporary results explicitly. What could be improved when FMA operations are executed in a loop is hoisting the CPU feature checking and/or load out of the loop (up to even creating two loops) if the loop body is small. I would suggest we create a new generic CPU feature detection issue for that and close this issue. |
Created the general CPU feature detection issue. Closing this one. |
Before using some CPU instructions, we must check for their presence. We use global variables in the runtime package to record features. Prior to this CL, we issued a regular memory load for these features. The downside to this is that, because it is a regular memory load, it cannot be hoisted out of loops or otherwise reordered with other loads. This CL introduces a new intrinsic just for checking cpu features. It still ends up resulting in a memory load, but that memory load can now be floated to the entry block and rematerialized as needed. One downside is that the regular load could be combined with the comparison into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE. (It is possible that MOVBQZX+TESTQ+NE would be better.) This CL does only amd64. It is easy to extend to other architectures. For the benchmark in #36196, on my machine, this offers a mild speedup. name old time/op new time/op delta FMA-8 1.39ns ± 6% 1.29ns ± 9% -7.19% (p=0.000 n=97+96) NonFMA-8 2.03ns ±11% 2.04ns ±12% ~ (p=0.618 n=99+98) Updates #15808 Updates #36196 Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb Reviewed-on: https://go-review.googlesource.com/c/go/+/212360 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Giovanni Bajo <rasky@develer.com>
Change https://golang.org/cl/227238 mentions this issue: |
In the commit message of CL 212360, I wrote: > This new intrinsic ... generates MOVB+TESTB+NE. > (It is possible that MOVBQZX+TESTQ+NE would be better.) I should have tested. MOVBQZX+TESTQ+NE does in fact appear to be better. For the benchmark in #36196, on my machine: name old time/op new time/op delta FMA-8 0.86ns ± 6% 0.70ns ± 5% -18.79% (p=0.000 n=98+97) NonFMA-8 0.61ns ± 5% 0.60ns ± 4% -0.74% (p=0.001 n=100+97) Interestingly, these are both considerably faster than the measurements I took a couple of months ago (1.4ns/2ns). It appears that CL 219131 (clearing VZEROUPPER in asyncPreempt) helped a lot. And FMA is now once again slower than NonFMA, although this change helps it regain some ground. Updates #15808 Updates #36351 Updates #36196 Change-Id: I8a326289a963b1939aaa7eaa2fab2ec536467c7d Reviewed-on: https://go-review.googlesource.com/c/go/+/227238 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
And
go run
.What did you expect to see?
math.FMA
is faster than non-FMA code.What did you see instead?
I confirmed my CPU have simd-FMA. Is this an overhead of function call?
The text was updated successfully, but these errors were encountered: