-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit more efficient code for simple for-comprehensions #8880
Comments
A simple way to remove the foreach is using case class MyRange(start: Int, end: Int) {
inline def foreach(inline f: Int => Unit) =
var i = start
val e = end
while (i < e) {
f(i)
i += 1
}
}
object App {
val count: Int = 4
for (i <- MyRange(0, count)) {
Console.println("Number: " + i)
}
} Which generates the code val count: Int = 4
{
val MyRange_this: MyRange = MyRange.apply(0, App.count)
{
var i: Int = MyRange_this.start
val e: Int = MyRange_this.end
while(i < e) {
Console.println("Number: ".+(i))
i = i.+(1)
}
}:Unit
} We could also try to use opaque types to avoid the creation of the object in some cases. |
We cannot do this in practice currently because |
Ouch, that sounds like a pretty big semantic landmine... almost at the level of surprise factor as the old (deprecated) Are there legitimate uses of |
I gave that a shot but pretty quickly ran into compiler issues. It seems that inline and opaque types are mutually exclusive (see #6802 ) |
@dhoepelman What if you worked around it by defining extension methods outside of the companion object in a trait, then implement the trait as given in companion object? object X {
given ExtensionOps
}
trait ExtensionOps {
inline final def map(...)...
} |
With extension methods, I got pretty close. object Ranges {
opaque type ZeroBasedRange = Int
object ZeroBasedRange {
given ExtensionOps
def apply(end: Int): ZeroBasedRange = end
}
trait ExtensionOps {
inline final def (range: ZeroBasedRange) foreach(inline f: Int => Unit) = {
var i = 0
val e = range.asInstanceOf[Int]
while (i < e) {
f(i)
i += 1
}
}
}
}
object App {
final val count: Int = 4
for (i <- Ranges.ZeroBasedRange(count)) {
Console.println("Number: " + i)
}
} This compiles into this: private App$() {
MODULE$ = this;
this.count = 4;
final Ranges$ module$ = Ranges$.MODULE$;
final int range = Ranges.ZeroBasedRange$.MODULE$.apply(this.count());
for (int i = 0, e = range; i < e; ++i) {
Console$.MODULE$.println((Object)("Number: " + i));
}
final BoxedUnit unit = BoxedUnit.UNIT;
} Apart from that one apply function call (which I'd assume hotspot would optimize out), this seems to be optimal code. Of course, it only works for ranges starting with 0. |
I wanted to see if there's a way to generalize this to Ranges that don't start at 0. The only way I could think of is by packing two Ints into a Long like this: object Ranges {
opaque type AnyRange = Long
object AnyRange {
given AnyRangeExtensionOps
def apply(start: Int, end: Int): AnyRange = start.toLong << 32 | end.toLong
}
trait AnyRangeExtensionOps {
inline final def (range: AnyRange) foreach(inline f: Int => Unit) = {
var i = (range.asInstanceOf[Long] >> 32).toInt
var e = range.asInstanceOf[Long].toInt
while (i < e) {
f(i)
i += 1
}
}
}
}
object App {
final val count: Int = 4
for (i <- Ranges.AnyRange(0, count)) {
Console.println("Number: " + i)
}
} This produces relatively efficient code: private App$() {
MODULE$ = this;
this.count = 4;
final Ranges$ module$ = Ranges$.MODULE$;
final long range = Ranges.AnyRange$.MODULE$.apply(0, this.count());
for (int i = (int)(range >> 32), e = (int)range; i < e; ++i) {
Console$.MODULE$.println((Object)("Number: " + i));
}
final BoxedUnit unit = BoxedUnit.UNIT;
}
// AnyRange$.apply:
public long apply(final int start, final int end) {
return (long)start << 32 | (long)end;
}
While there's some bit-shifting, we at least don't pay any cost for object creation. |
Note that using bitwise tricks like that can potentially confuse the JIT and prevent it from eliding bounds checks. |
You can use inline matches to create and then pattern-match on a phantom compile-time only case class - this should allow you to omit the structure completely when it's known at compile-time and fallback to accessing the structure fields otherwise. |
Any idea how to accomplish this with opaque types? Also, I'd assume that while it isn't ideal, the current Scala/Dotty solution of creating a Range instance would have the same issue? |
Would you mind giving me a pointer to how this could be done? I looked up phantom classes, but it seems like they were replaced by unused, which was then renamed to ghost. And I can't find any current document about that now :-) |
No, but I don't think that's a good solution anyway compared to getting inlining working right.
Depends what code the JIT manages to inline. |
@danlehmann final case class MyRange(from: Int, to: Int)
inline def (from: Int) to (to: Int): MyRange = MyRange(from, to)
inline def (inline myRange: MyRange).foreach(inline f: Int => Unit): Unit = {
inline myRange match {
case MyRange(from, to) => while(...) { ... }
}
}
def test = for (i <- 0 to 5) { println(i) } see http://dotty.epfl.ch/docs/reference/metaprogramming/inline.html#inline-matches |
This is perfect, thanks a lot. This creates close to optimal code without any objects or even function calls. This could be quite useful in case the inliner isn't able to resolve the general case by the time Scala 3.0 ships. |
It's possible to add an inline method overriding package example
import scala.quoted.unsafe.UnsafeExpr
object MyPredef {
type MyRange <: Range.Inclusive & InlineForeach
inline def (inline from: Int) to (inline to: Int): MyRange = new Range.Inclusive(from, to, 1).asInstanceOf[MyRange]
sealed trait InlineForeach { this: MyRange =>
inline final def foreach(inline f: Int => Unit): Unit = ${ macroImpl('this)('f) }
}
import quoted._
def macroImpl(self: Expr[MyRange])(f: Expr[Int => Unit])(using qctx: QuoteContext): Expr[Unit] = {
UnsafeExpr.underlyingArgument(self) match {
case '{new Range.Inclusive($from, $to, 1).asInstanceOf[MyRange]} =>
'{ inlineForeach($from, $to + 1, $f) }
}
}
inline def inlineForeach(inline from: Int, inline end: Int, inline f: Int => Unit): Unit = {
var i = from
while (i < end) {
f(i)
i += 1
}
}
}
import MyPredef._
object App {
def main(args: Array[String]): Unit = {
for (i <- 1 to 5) do println(i)
}
} Although this version with tasty-reflect generates less optimal code than an inline match does: public void main(String[] args) {
new Inclusive(1, 5, 1);
for(int i = 1; i < 6; ++i) {
((ix) -> {
.MODULE$.println(BoxesRunTime.boxToInteger(ix));
}).apply(BoxesRunTime.boxToInteger(i));
}
BoxedUnit var10000 = BoxedUnit.UNIT;
} No idea how to eliminate the lambda and the redundant |
The only sound way to demand the elimination of the prefix is using an extension method where you define the prefix as inline (or maybe by name). Otherwise, we can only rely on a backend optimizer (including JIT). The use of |
Hmm, is there a way to unpack |
You would need to define the inline method as inline def (inline self: MyRange).foreach(inline f: Int => Unit): Unit = ${ macroImpl('self)('f) } |
@nicolasstucki That wouldn't work because there's already a member method |
Indeed. Maybe, in this case, we should Why do you need to have it fall back to |
Implicit conversions tend to fail in many cases where variance works well. Because I'm trying to show something that could work as part of |
We have inline def, why not have inline class? Similar to inline def, an inline class would be inlined at call site. All def in an inline class would also have to be inline. def efficientLoop =
for x <- Range(0,10,1) do
println(x)
inline class Range(val start: Int, val end: Int, val step: Int):
inline def foreach(f: Int => Any): Unit =
var i = start
while i < end do
f(i)
i += step would translate to something like def efficientLoop =
val r$start = 0
val r$end = 10
val r$step = 1
var i = r$start
while i < r$end do
println(i)
i += r$step Of course, if you want to store a
If the aforementioned limitation is too big or seems too weird, note that we don't even need I've found a very old SIP about inline classes 😄 However, it is a little different, because it compiles methods to actual methods in the companion object. Note that this is also different from Alternative: special-case |
The Scala 2 backend optimizer (which hasn't been ported to Scala 3 so far) does this sort of best-effort inlining: https://www.lightbend.com/blog/scala-inliner-optimizer, but in practice it doesn't help a lot on top of what the JIT does, and it clashes with separate compilation so only end-user applications can enable it safely. |
It seems that this optimizer cannot eliminate allocations. def main(): Unit = {
for (x <- 1 until 10) {
println(x)
}
} turns, with public void main() {
boolean bool = true;
int until$extension_end = 10;
Range.Exclusive exclusive = new Range.Exclusive(bool, until$extension_end, 1);
if (!exclusive.isEmpty()) {
int foreach$mVc$sp_i = exclusive.start();
while (true) {
Integer integer = Integer.valueOf(foreach$mVc$sp_i);
Object println_x = integer = null;
Console$.MODULE$.println(println_x);
println_x = null;
if (foreach$mVc$sp_i != ((Range)exclusive).scala$collection$immutable$Range$$lastElement) {
foreach$mVc$sp_i += exclusive.step();
continue;
}
break;
}
}
} The foreach is inlined to a while loop, which is much better than what the dotty backend currently produces, but still far from perfect. In the case of for-loops it helps quite a lot, however. I guess that porting the backend is planned but still TODO? |
First of all, I do not think it is worth bothering with an inlining optimizer in Scala 3. JVM will do a better job at it, given the runtime characteristics. What you should do instead, is make sure that the compiler generates JVM-friendly code. In lampepfl/dotty-feature-requests#308 I give exactly this argument. If you use a bare for loop, JVM can do many optimizations. It can unroll your loops, it can unswtich your loop, It can special case first and last iteration, it can split loops in two or more. I think you are throwing all these great optimizations out of the window by generating a code that uses foreach with lambda. Desugaring of for compehension in the compiler is already governned by a spec (you desugar for into filter and map). Since the compiler already does something special to desugar the for comprehension, why not do this as well? |
If you want the compiler to generate such JIT-compiler-friendly code in way more cases that just Of course, the spec could be modified to special-case that simple for-loop and declare that it translates to a (1) Actually, what would be great isn't just method inlining but scalar replacement, i.e. removing the object allocation completely, as I mentioned earlier. The Hotspot's JIT compiler does not do this. GraalVM does this a bit, because it has an escape analysis phase, but not everyone runs on GraalVM. |
Currently, the desugaring of for comprehension is purely syntactic (it happens before typechecking, on the untyped AST) but to decide whether we can safely optimize a for-comprehension into a for loop, we'd need to typecheck it first (otherwise we can't know if I don't think there's anything fundamental that prevents us from having a type-directed desugaring of for comprehensions, but it will require a fair amount of work. It might be worth it since it should make it easier to solve other long-standing issues with for-comprehensions (in particular, #2573 and re-implementing https://github.com/oleg-py/better-monadic-for#final-map-optimization) |
A general purpose optimizer with inlining, scalar replacement and constant folding can optimizing away the range I know this is possible because Scala.js does it. |
I've been using Scala for years to write performance-critical code. One the pain points has always been the sub-par performance or seemingly benign for-comprehensions.
While it is clear that complex for-comprehensions emit inefficient code, a very common use-case is using them as a replacement for simple for-loops.
As an example, the two loops below do the same thing. The for-comprehension is cleaner (i is scoped more narrowly, no need for var). But one look at the decompiled code shows you just how much more inefficient the for-comprehension is.
Minimized code
Output
I compiled using dotty-0.24.0-RC1. This is the decompiled output:
Expectation
Is there some way to allow using for-comprehensions with a guarantee that performant code is being produced? Maybe it would be possible to introduce something like @scala.annotation.for (analogous to @scala.annotation.switch)?
For me, such an improvement would have a big impact: In my project, I spent hours translating for-comprehensions into simple while loops to reap (measurable) performance benefits. But it is unfortunate that in the process I had to make my code less readable and more error prone. Scala 2.13's inliner helped somewhat, but the results weren't reliable, so I mostly ended up avoiding for-comprehensions.
The text was updated successfully, but these errors were encountered: