-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loop optimizations investigation #428
Comments
See bc8f2de on how to inspect the machine instructions. Initial impressions:
|
Pushed a commit with updated jmh runs. @tnielens You asked a few weeks ago about those macro based inlined ops. It turns out there are some jmh benchmarks in the tree for them. At least in some of these benchmarks the macros are on par with the raw arrays, while the regular ops are much slower. In some other of these runs all 3 implementations were similar. Also I am not sure how to interpet the dot product results. I expected that the manual unroll will be the slowest as that is supposed not to trigger the auto vectorization (following the linked presentation).
jdk17.02, older intel cpu with avx1 |
Awesome! Great post you linked: https://blogs.oracle.com/javamagazine/post/java-hotspot-hsdis-disassembler Good to know about the existing benchmarks. I wanna have a look at them. |
Some remarks about This can observed with this benchmark: @State(Scope.Benchmark)
// important
@Fork(0)
@Threads(1)
class FourInPlaceScalarOpBench {
@Param(Array("100"))
var size: Int = _
var m1: Mat[Double] = _
var b: Double = _
@Setup(Level.Iteration)
def setup() = {
m1 = mat.randn(size, size)
b = scala.util.Random.nextDouble()
}
@Benchmark
def saddleVirtual1Add(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 += b
m1
}
@Benchmark
def saddleVirtual1Mul(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 *= b
m1
}
@Benchmark
def saddleVirtual1Sub(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 -= b
m1
}
@Benchmark
def saddleVirtual2Add(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 += b
m1
}
@Benchmark
def saddleVirtual2Mul(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 *= b
m1
}
@Benchmark
def saddleVirtual2Sub(): Mat[Double] = {
import org.saddle.ops.BinOps._
m1 -= b
m1
}
} Results:
This behavior is also described in one of the jmh samples. |
Very interesting. Thank you! Unfortunately the Op is very likely to be strongly polymorphic in real world client code ie that virtual call ordeoptimization will occur unless inlined by the macro. I will open a pr to document this. Another possibility is to rework the saddle types to eliminate the Op interface. However this would likely lead to the same inlined code as in the macro module. I would not 100% oppose simplifying this part of saddle, that is removing the whole BinOp abstraction machinery. |
Adding extra // BinOpMat
@inline
implicit def MatSclrElmOpIpDDD[Op <: ScalarOp](implicit
op: BinOp[Op, Double, Double, Double]
)
final class MatSclrElemOp[
OP <: ScalarOp,
@spec(Boolean, Int, Long, Double) A,
@spec(Boolean, Int, Long, Double) B,
@spec(Boolean, Int, Long, Double) C: ST
](val op: BinOp[OP, A, B, C])
extends BinOp[OP, Mat[A], B, Mat[C]] {
@inline def apply(v1: Mat[A], v2: B) = {
...
// NumericOps
@inline
final def +=[B](
other: B
)(implicit op: BinOpInPlace[Add, This, B]): Unit =
op(this, other)
@inline
final def -=[B](
other: B
)(implicit op: BinOpInPlace[Subtract, This, B]): Unit =
op(this, other)
// Results
That said, the scala optimizer inlining must be enabled explicitely in user code for these annotations to take effect. Details are here and there are some caveats with binary compatibility. Libraries based on saddle should not enable inlining. |
Thank you for the investigation. If I understand that means that if we sprinkle enough annotations and the very end user of the final application adds -opt:inline:** (and potentially -opt:local ) then the megamorphic call sites will be inlined. I would vote for this solution in favor the macros. Only thing is we need a suite to ensure this happens. What is your suggestion how should we proceed on these? Should we first make a benchmarking suite where all operations and methods are checked against a hand written array based implementation and use that as a guide? (I don't propose to include that in the CI) |
I've been trying out other options. Here is an attempt at using @inline annotations only internally, shielding the user from that complexity. It goes a bit in the direction of the macro-based implementation. // Binary element-wise operation on one Mat and one scalar
@inline
private def matSclrElemOpIpBody[
OP <: ScalarOp,
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](v1: Mat[A], v2: B, op: BinOp[OP, A, B, A]): Unit = {
val sz = v1.length
val v1a = v1.toArray
var i = 0
while (i < sz) {
v1a(i) = op(v1a(i), v2)
i += 1
}
}
implicit def matSclrElemOpIpAdd[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Add, A, B, A]): BinOpInPlace[Add, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
implicit def matSclrElemOpIpSub[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Subtract, A, B, A]): BinOpInPlace[Subtract, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
implicit def matSclrElemOpIpMul[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Multiply, A, B, A]): BinOpInPlace[Multiply, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
implicit def matSclrElemOpIpDiv[
@spec(Int, Long, Double) A,
@spec(Int, Long, Double) B
](implicit op: BinOp[Divide, A, B, A]): BinOpInPlace[Divide, Mat, A, B] =
matSclrElemOpIpBody(_, _, op)
// ... etc The idea is to inline the body of the while loop once per operation to avoid megamorphic call sites. Specialization takes care of duplicating the loop further for each specialization combination. |
In this latest you compile saddle with -opt:inline:org.saddle.** ? |
Yes. That's already part of the build config. I didn't need to add anything. |
Out of the three alternatives (macro, inline later, inline within saddle) which one do you prefer? |
I'm hesitant on this. Both @inline and macros are dropped in scala3. I don't think we should set out on a large refactoring and rework important bits on top of deprecated features. I'm not so keen on the @inline based proposals of mine up here, because they rely on jvm optmizations or JIT options that saddle can't test properly. I plan to spend some time understanding what breeze did. They seem to have used macros in the past and now have a scalameta-based solution for scala2-3 cross building. |
@pityka Here is an extract of the code generation output of DenseVectorOps.scala . trait DenseVector_Vector_ExpandOps extends VectorOps with DenseVector_TraversalOps with DenseVector_SlicingOps {
implicit val impl_OpMulInner_DV_V_eq_S_Int: breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Int], Vector[Int], Int] = {
new breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Int], Vector[Int], Int] {
def apply(a: DenseVector[Int], b: Vector[Int]) = {
require(b.length == a.length, "Vectors must be the same length!")
val ad = a.data
var aoff = a.offset
var result: Int = 0
cforRange(0 until a.length) { i =>
result += ad(aoff) * b(i)
aoff += a.stride
}
result
}
implicitly[BinaryRegistry[Vector[Int], Vector[Int], OpMulInner.type, Int]].register(this)
}
}
implicit val impl_OpMulInner_DV_V_eq_S_Double: breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Double], Vector[Double], Double] = {
new breeze.linalg.operators.OpMulInner.Impl2[DenseVector[Double], Vector[Double], Double] {
def apply(a: DenseVector[Double], b: Vector[Double]) = {
require(b.length == a.length, "Vectors must be the same length!")
val ad = a.data
var aoff = a.offset
var result: Double = 0.0d
cforRange(0 until a.length) { i =>
result += ad(aoff) * b(i)
aoff += a.stride
}
result
}
implicitly[BinaryRegistry[Vector[Double], Vector[Double], OpMulInner.type, Double]].register(this)
}
}
// ...
// goes on much longer If saddle wants to refactor its BinOps implicit instances, I'd consider using this expand annotation from breeze, which is available as a separate dependency. That would provide a unified solution for scala2 and 3, with performance on par with the current macro-based one. |
Thanks! I agree this is a good way forward. I can probably find some time for this during summer or let me know if you will contribute this refactor to saddle. No rush from my side. I dont plan to use scala3 this year. |
Another possibility is to do nothing and hope that scala3 will have a tasty based whole program optimizer one day. Hard to believe that ad hoc template polymorphism (which the scalameta route is) would be the recommended way. If we are in a pinch then out of the routes considered that is the best, but globally it is a step backward. |
I found some rumors that the JVM will have automatic specialisation in the mid future. This should be weighed in when considering the code generation route. |
Hotspot has support for auto-vectorization. That is, the jvm will identify certain looping patterns on arrays and generate vectorized instructions (SIMD).
Investigate whether saddle triggers these optimizations by inspecting the produced x86_64 assembly. Good candidates for the investigations are the binary operations implemented in
VecSclrElemOp
,VecVecElemOp
, etc.The text was updated successfully, but these errors were encountered: