Any function is implemented in generic Go and if it is justified, it is optimized for AMD64 (using SSE2 instructions).
AMD64 implementation uses MOVUPS/MOVUPD instructions if all strides equal to 1 so it run fast on Nehalem, Sandy Bridge and newer processors but relatively slow on older processors.
Any implemented function has its own unity test and benchmark.
Level 1
Sdsdot, Sdot, Ddot, Snrm2, Dnrm2, Sasum, Dasum, Isamax, Idamax, Sswap, Dswap, Scopy, Dcopy, Saxpy, Daxpy, Sscal, Dscal, Srotg, Drotg, Srot, Drot
Level 2
not implemented
Level 3
not implemented
####Example benchmarks
Function | Generic Go | Optimized for AMD64 |
---|---|---|
Ddot | 2825 ns/op | 895 ns/op |
Dnrm2 | 2787 ns/op | 597 ns/op |
Dasum | 3145 ns/op | 560 ns/op |
Sdsdot | 3133 ns/op | 1733 ns/op |
Sdot | 2832 ns/op | 508 ns/op |