Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The benchmark penalizes languages that need to start a runtime environment (e.g. anything on the JVM or .Net) #220

Open
PEZ opened this issue Dec 6, 2024 · 17 comments

Comments

@PEZ
Copy link
Collaborator

PEZ commented Dec 6, 2024

Hi,

Thanks for creating this project, @bddicken! 🙏

I have been running benchmarks in some different ways the last few days. I notice that a lot of the difference between the Java performance and C is about JVM start times. And for Clojure the penalty is much bigger, because then also Clojure needs to start before the benchmarks can start crunching. To “see” the start time I created a hello-world benchmark, where the programs only print “Hello World!”. Here's a run on my Mac M4:

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):       2.2 ms ±   0.2 ms    [User: 0.7 ms, System: 1.2 ms]
  Range (min … max):     2.0 ms …   2.5 ms    10 runs
 

Benchmarking Java
Benchmark 1: java jvm.code 40
  Time (mean ± σ):      38.7 ms ±   0.9 ms    [User: 14.1 ms, System: 21.4 ms]
  Range (min … max):    37.3 ms …  40.2 ms    10 runs
 

Benchmarking Clojure
Benchmark 1: java -cp clojure/classes:src:/Users/pez/.m2/repository/org/clojure/clojure/1.12.0/clojure-1.12.0.jar:/Users/pez/.m2/repository/org/clojure/core.specs.alpha/0.4.74/core.specs.alpha-0.4.74.jar:/Users/pez/.m2/repository/org/clojure/spec.alpha/0.5.238/spec.alpha-0.5.238.jar code 40
  Time (mean ± σ):     275.0 ms ±   4.3 ms    [User: 410.4 ms, System: 43.4 ms]
  Range (min … max):   270.0 ms … 285.0 ms    10 runs
 

Benchmarking Babashka
Benchmark 1: bb -cp clojure -m code 40
  Time (mean ± σ):      11.9 ms ±   0.4 ms    [User: 3.3 ms, System: 6.1 ms]
  Range (min … max):    11.1 ms …  12.4 ms    10 runs

Basically the JVM takes 40ms to start, and then Clojure takes another 240ms. To be compared with the benchmark times for e.g. C, Java, and Clojure, which run at 250, 350, and 550 ms, respectively on the same machine.

I can submit the hello-world benchmark as a PR, if you're interested. Start times can be very important for e.g. shell scripting purposes and such. In the Clojure world we have Babashka which is interpreted Clojure. It starts super quickly, but of course can't compete with compiled languages in the other benchmarks.

Some questions:

  • Is this something you have though any about?
  • Would you be interested in me exploring ways to adapt the benchmark to give these VM-hosted languages a fairer run? Some ideas that so far has popped to my mind:
    • Include the hello-world benchmark and then use it to subtract the times there from the times in the other benchmarks.
    • Another is to add some way to use different benchmarking tools. So for the JVM, use some tool that runs the programs from inside the JVM.
    • Make the programs take a second argument specifying a number of runs. So the fibonacci 40 would be run x times, and with a significantly high x, the startup times would play much lesser role.
    • Extend the runs by running more loops and higher fibs.
    • Other than that I have no more ideas right now 😄 But am willing to start sourcing ideas and doing experiments and such.

WDYT?

@PEZ
Copy link
Collaborator Author

PEZ commented Dec 6, 2024

See also:

Where there is some more discussion and benchmark results for Clojure and Java.

@AndrewBabbitt97
Copy link

AndrewBabbitt97 commented Dec 19, 2024

TBH for .NET the default and only option should just be NativeAOT with optimizations set for speed. For any of these kind of benchmarks running non-aot is just putting a handicap on the language / runtime.

@PEZ
Copy link
Collaborator Author

PEZ commented Dec 19, 2024

@AndrewBabbitt97 I'm beginning to think the same.

@giordano
Copy link

giordano commented Dec 21, 2024

The same applies to Julia, which is a just-in-time compiled language: you're mainly measuring startup time, the time to run the function is a fraction of it. I slightly modified the code to be able to measure runtime:

diff --git a/levenshtein/julia/code.jl b/levenshtein/julia/code.jl
index 04ab029..af88189 100644
--- a/levenshtein/julia/code.jl
+++ b/levenshtein/julia/code.jl
@@ -45,8 +45,8 @@ function levenshtein_distance(s1::String, s2::String)::Int
     return prev_row[m + 1]
 end
 
-function main()
-    if length(ARGS) < 2
+function main(strings::Vector{String})
+    if length(strings) < 2
         println("Please provide at least two strings as arguments.")
         exit(1)
     end
@@ -55,10 +55,10 @@ function main()
     times = 0
 
     # Compare all pairs of strings
-    for i in 1:length(ARGS)
-        for j in 1:length(ARGS)
+    for i in eachindex(strings)
+        for j in eachindex(strings)
             if i != j
-                distance = levenshtein_distance(ARGS[i], ARGS[j])
+                distance = levenshtein_distance(strings[i], strings[j])
                 if min_distance == -1 || distance < min_distance
                     min_distance = distance
                 end
@@ -67,11 +67,12 @@ function main()
         end
     end
 
-    println("times: ", times)
-    println("min_distance: ", min_distance)
+    return times, min_distance
 end
 
 # Run main only if script is run directly
 if abspath(PROGRAM_FILE) == @__FILE__
-    main()
+    times, min_distance = main(ARGS)
+    println("times: ", times)
+    println("min_distance: ", min_distance)
 end

then what you're measuring is

$ input=($(cat ../input.txt))
$ cmd="julia code.jl ${input[@]}"
$ hyperfine --shell=none "${cmd}"
Benchmark 1: julia code.jl aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb cccccccccccccccccccccccccccccccccccccccccccccccccccc ddddddddddddddd eeeeeeeeeeeeeeeee ffffffffffffff ggggggggggg hhhhhhhhhhhhhhhhhhhhhhhhhhhhh iiiiiiiiiiiiiiiiiiiiiiiiiiiii jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk llllllllllllllllllllllllllllllllllllllllllllll mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ooooooooooooooooooooooooooooooooo pppppppppppppppppppppppppppppppppppppppppp qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq rrrrrrrrrrrrrrrrrrrrrrrrrrrrr ssssssssssssssssssssssssssssss tttttttttttttttttttttttttt uuuuuuuuuuuuuuuuuuuuuuuuu vvvvvvvvvvvvvvvvvvvvvvvvvvvvv wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm 1234567890 QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890 mnxbckfjquwhduywetdftfqrcfgzbxmcsdlsppweuouhftvjcsmnbsdvuqtfvcjbxmcnbkaoqwerpweiurigjhdfjbmsbjgqiuyeiruyoisbnsvuyfguweygrhoaiosjdkjhfbsjhdvfkqjnoi gaywgeudgaewytfvajsdvfjahsdvfjhsdfjhgsfjhqowiueryiuweytruiyytfyqvhgvzxncbsidfjdflsdfpwoeiriuwheuyfvydtvqyuygbhskdjfbkajbuywgduywegfvsnmwehriygwuerygwerdfjhsdfsdkfhbsjdhfjuavufywefhuwlihfoqwehqrwheiruyuyqgruwieqrhsdjhfbjwhvdfisdufhikquhweifgbwfisdhfksjdhfiwuh ASDFHWERFIWSHHSDFJWERPWFOAHDSDFXMCVNWMERKJQEOQIWRUQIHJSBFSDFKJSDFOWIERIOU werkjhiauwryisydgcfkjsdfkjsjdhfguiyqgtfsdfSDGFDSFGwsrhqwigdeyiwefDFBdkqedfgdFHGHKJiFHTYRGFsefjhwhsgdfjhhjsdfDWEfsdfWEFfbgFGjYuioIOpvbnVBNSDFadfsSDFwegsdgfAWFDFGDFghjTyIGHJREGFsddqwdsdfweaWQAZZSDGgnlpHmHJMgkOYUTDFGSFSDFSDFDHGDFGSDFDGRFjhbjshdgfjhgsdfSDGDFG kjsdfkhgwieyfguwsyefgsdfSDFGJSDAKFDSAFIRFYUDSFHSBVXCVBNMSDKFJWOYSDFHKADSDnfsdfjbjshdbfkwjbfksdbfhwbkyguygyfshbcdmnxbvcmnsdkfsdfsdflspflwoekorwueiruygwuygjshbvnbvzcjsuhfdiuhsdfghkwjhdfiwegfjdhsgfbksjhfksdhgfhsdhfghfgfgfsjdjfhkwjehoueuq abcdefghijklmnopqrstuvwxyz thistringisnottoolong howcanwemakeabetterstringforcomparing absfeiihdkkeiurmnslkspoiqwrjlnjna jsfjlqpowiuewriugbsdmfasgdfqsfwwiewurmxbcaslkdjp qwerdsdftffygfuhbuhiok sfhdgafsdhafsdjaweqwueytuqwyefgagsdgfasdfajfgsieufowerpoafposdfmnxbcvjlkhhgdfkjasfguysdfuayfgjhvf ygiuwoqupqowei sjxvkzjha hfgauyiuyrwer sdf dsfgsdfdfg dfgdswefdsdgd ffhfghsdfghsdfeuryiwuyeriudhfmnsbmnbvsgdfkuweifgwefgisdgf sdfskdfhgksugfweriuwersdkfjbsewoir sfdqrieyft sdfnueyrtuwyerowierpoipmvmcxnvmnbsdnfbajeygruywgefugsdbaqwriuweuywiuyer dsfjgfuywegfugsufgqfgdjsdhfjhsdgfhsgdfiweoruoweruwer sdfhyairuywoerfuetuopiufudsvxcbvmsdflksfwpeoriiueief dsfygweurytwueyr sdfshfgwegfisfisdgfowuyeridfs sdfhsdfgsudyfguwyef sdfjsgdfysgdfuwgueyfs dfsgdfusgdfiweoriuwqoiru ygdifuwuoerupweurpquerowehmnxcbvmnxbvjcquqyerypeiproweur fhsdgfygsdfuygsudfgufrpesifgsdfshdfbuwyveugfud fefgiuyegsiufygdfigreoquyarcfdbnmnsbfdlsdfoiweprisdfqbyefg sdfugwuefgusdygfuygwdufytwuqytueryt efsgdvfuysgfdusydfuywguyeryoqurwreiueieieiiii
  Time (mean ± σ):     192.0 ms ±   7.8 ms    [User: 369.2 ms, System: 30.7 ms]

but the actual run time spent in the main function is

% julia -L code.jl -q
julia> using BenchmarkTools

julia> @benchmark main(strings) setup=(strings=String.(split(readchomp("../input.txt"), ' ')))
BenchmarkTools.Trial: 195 samples with 1 evaluation.
 Range (min … max):  24.934 ms …  28.091 ms  ┊ GC (min … max): 0.00% … 8.00%
 Time  (median):     25.646 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.715 ms ± 379.302 μs  ┊ GC (mean ± σ):  0.33% ± 0.98%

                ▂▂▃▂█▅█  ▁
  ▄▁▁▃▄▃▃▃▄▁▄▆▆▇███████▇██▅▄▅▅▄▅▆▄▅▄▅▃▄▁▁▄▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
  24.9 ms         Histogram: frequency by time         27.1 ms <

 Memory estimate: 2.07 MiB, allocs estimate: 15628.

Edit: just noticed this issue overlaps with #149.

@Gkodkod
Copy link
Contributor

Gkodkod commented Dec 31, 2024

Hey @PEZ

In regards to your Linkedin post you wrote today and our conversation there:
image

My prime suspect for why the PGO enabled compile hurts the Clojure native build for the fibonacci program is C2 JIT compiler Escape Analysis. In theory it should detect that the variables in the recursion can't escape, but instead it produces some code that causes the heap to be used and that in turns wakes the GC at some point.

I'm sorry I can't really describe it well, because I don't quite understand it well enough. Here's Escape Analysis explained:

https://lnkd.in/ddNnb95r

I think that during one, or only a few, runs through the fibonacci(40), C2 is never asked to do any escape analysis (or possibly not in play at all). I'm not sure what happens during the PGO enabled compile, but thinking the profiling data puts C2 escape analysis into play.

I can trigger the slowdown by doing something like:

(dotimes [i 20] (println i) (time (-main "40")))

Then the first 10 or so runs it will run at, say 300ms, and then they start to run in 700ms instead.

Anyway, what I really would want to do is to write the implementation in such a way that the escape analysis succeeds (if, indeed, this is the correct problem) and so that I can enable PGO without hurting performance (in theory there is some performance to unlock, right?). If someone wants to have a look and experiment with it, please do. I am not knowledgeable enough to formulate new ideas of things to try.

Let me try to explain... and suggest ideas to try

The core problem is likely related to how the C2 JIT compiler performs Escape Analysis during the Profile-Guided Optimization (PGO) process:

  1. In the naive recursive Fibonacci implementation, the function creates many small, short-lived objects during recursive calls.

  2. Normally, the JIT compiler's Escape Analysis should:

  • Detect these objects don't "escape" the method
  • Eliminate heap allocations
  • Replace object allocations with scalar operations
  • Keep everything in registers or stack
  1. However, during PGO, something goes wrong:
  • The profiling data might trigger a different optimization path
  • Instead of eliminating allocations, it forces heap allocations
  • This causes unnecessary garbage collection overhead
  1. The performance degradation occurs because:
  • Heap allocations are created where they shouldn't be
  • Garbage collector gets triggered more frequently
  • Register/stack optimizations are bypassed

The key observation is the inconsistent behavior across multiple runs, where performance suddenly degrades after a few iterations. Not sure if my explanation is clear enough, yet I got some ideas for you. I'll provide a few alternative Clojure Fibonacci implementations that might help circumvent the Escape Analysis issue:

  1. Primitive-Focused Recursive Implementation:
;;;; Key optimizations:
;;;; 1. Use ^long type hint
;;;; 2. Use unchecked- operations to avoid boxing
;;;;  3. Explicit long arithmetic
;;;;  4. Minimal object creation
(defn fibonacci-primitive [^long n]
  (if (< n 2)
    (long n)
    (+ (fibonacci-primitive (unchecked-dec n))
       (fibonacci-primitive (unchecked-dec (unchecked-dec n))))))

Pros:

  • Closest to the original naive recursive implementation
  • Type hints help JVM understand primitive types
  • Minimal overhead

Cons:

  • Exponential time complexity O(2^n)
  • High risk of stack overflow for large n
  • May still create intermediate objects
  1. Tail-Recursive with Primitives:
;;;; Key optimizations:
;;;; 1. Tail recursion (easier for JIT to optimize)
;;;; 2. Primitive loop variables
;;;;  3. No intermediate object allocations
;;;;  4. Direct arithmetic
(defn fibonacci-tailrec [^long n]
  (loop [a (long 0)
         b (long 1)
         count (long n)]
    (if (zero? count)
      a
      (recur b 
             (unchecked-add a b) 
             (unchecked-dec count)))))

Pros:

  • Linear time complexity O(n)
  • Constant space complexity
  • JVM can optimize tail recursion
  • No intermediate object allocations

Cons:

  • Less readable than recursive approach
  • Requires understanding of tail-call optimization
  1. Memoized Functional Approach
;;;; Key optimizations:
;;;; 1. Memoization prevents redundant calculations
;;;; 2. Primitive type hints
;;;;  3. Cached results reduce computational overhead
(def fibonacci-memo 
  (memoize 
   (fn [^long n]
     (if (< n 2)
       (long n)
       (+ (fibonacci-memo (unchecked-dec n))
          (fibonacci-memo (unchecked-dec (unchecked-dec n))))))))

Pros:

  • Caches previous calculations
  • Dramatically reduces computational complexity
  • Preserves functional programming style

Cons:

  • Uses memory to store memoized results
  • Overhead of memoization for small n
  • Potential memory leak if not bounded
  1. Transient Iterative Approach:
(defn fibonacci-transient [^long n]
  (loop [a (long 0)
         b (long 1)
         count (long n)]
    (if (zero? count)
      a
      (recur b 
             (unchecked-add a b) 
             (unchecked-dec count)))))

Pros:

  • Most predictable performance
  • Lowest memory overhead
  • Easiest for JIT to optimize
  • Linear time and space complexity

Cons:

  • Least "functional" approach
  • Imperative style
  • Less elegant than recursive solutions

These impl suggestions, of course, need to be highly benchmarked, using hyperfine like all other languages. Use primitive-type hints/arithmetic while minimizing objs creation and preferring tail recursion or iteration. Also, they make escape analysis "obvious," reducing GC pressure while making it easier for JIT to optimize. They should work well with with Profile-Guided Optimization, I hope.

Let @bddicken and I know your findings, Master and a most Happiest New Years !

@PEZ
Copy link
Collaborator Author

PEZ commented Dec 31, 2024

Thanks, @Gkodkod ❤️

I have spent a significant amount of time in this repo rejecting contributions that go for, memoization, tail recursion, and iteration, since it is not in the spirit of the benchmark which seeks to measure a language's function call overhead performance. So we can't really go there with this one.

As for the primitive focused recursive approach, I think that's largely what we do already:

(defn- fibonacci ^long [^long n]
  ((fn fib ^long [^long n]
    (if (<= n 1)
      n
      (+ (long (fib (- n 1)))
         (long (fib (- n 2)))))) n))

Not that we actually recurse in a named lambda. This is because using a namespaced function for it, makes the Clojure compiler produce a different kind of function call which hurts performance badly. I think that it is something with this “private” function where C2 goes wrong.

The type hints and casts are enough for the Clojure compiler to understand that it should do unchecked integer math. The above function, when compiled, decompiles to this:

// Decompiling class: code$fibonacci$fib__10924
import clojure.lang.*;

public final class code$fibonacci$fib__10924 extends AFunction implements LL
{
    @Override
    public final long invokePrim(final long n) {
        return (n <= 1L) ? n : (RT.longCast(this.invoke(Numbers.num(Numbers.unchecked_minus(n, 1L)))) + RT.longCast(this.invoke(Numbers.num(Numbers.unchecked_minus(n, 2L)))));
    }
    
    @Override
    public Object invoke(final Object o) {
        return ((IFn.LL)this).invokePrim(RT.uncheckedLongCast(o));
    }
}

As far as I know that's as good as it gets. And we are looking at the benchmark winning code, which I think agrees with that assessment.

I've also tried using a let binding for the function, and let it live inside -main, but it triggers the same PGO (and long-running) issue, and I opted for the current solution because it's a bit easier to see what's going on (at least for me).

@Gkodkod
Copy link
Contributor

Gkodkod commented Jan 1, 2025

@PEZ Thank you for explaining your thoughts about this so eloquently. I also appreciate the produced decompiled Java code with your winning code.

With that being said, allow me to compare this code to my suggested Primitive-Focused Recursive implementation with regards to performance differences, and of course, you can decide which way you want to go.

Current Code:

 (defn- fibonacci ^long [^long n]
  ((fn fib ^long [^long n]
    (if (<= n 1)
      n
      (+ (long (fib (- n 1)))
         (long (fib (- n 2)))))) n))

My Suggested Implementation:

(defn fibonacci-primitive [^long n]
  (if (< n 2)
    (long n)
    (+ (fibonacci-primitive (unchecked-dec n))
       (fibonacci-primitive (unchecked-dec (unchecked-dec n))))))

Key Differences:

1. Nested Anonymous Function vs Direct Recursion

  • Existing: Uses an inner anonymous function (fn fib ...)
  • Suggested: Direct recursive function definition

2. Explicit Type Casting

  • Existing: Uses (long ...) to cast recursive calls
  • Suggested: Relies more on type hints and unchecked- operations

3. Condition Check

  • Existing: (<= n 1)
  • Suggested: (< n 2) - slightly more idiomatic

4. Arithmetic Operations

  • Existing: Uses standard subtraction (- n 1)
  • Suggested: Uses unchecked-dec for potential performance gain

5. Complexity

  • Existing: Slightly more complex due to nested function
  • Suggested: More straightforward, direct recursion

The performance implications between the two are slight and might make a difference to the decompiled Java code:

  1. The existing code introduces an extra function call per recursion
  2. The extra (long ...) casting might create more intermediate objects
  3. My suggestion aims to minimize object creation and function call overhead, which as you mentioned, hurts performance badly.

My suggested implementation is likely to be more friendly to Escape Analysis because it could potentially: produce fewer function calls, less explicit type casting, have more direct object lifetime, and uses unchecked- operations which hint at primitive behavior. Again, it is your call to make if you would like to try testing it. I would love to see if it improved the actual test results and the decompiled Java including the main as can further analyze them both.

I wish had more time to spend on the startup time issues you've mentioned, and actually had some ideas, yet I cannot offer any advice without clearly understanding the issues from you. It is a brillant idea to include it in your latest benchmark visualization. Additionally, allow me to once more thank you for the most excellent time and work you've put into the whole repo and especially the most brillant benchmark visualization repo you put together. It's so amazing and helpful for so many!!!

❤️❤️❤️ Gott nytt år till dig och hela familjen! ❤️❤️❤️

@PEZ
Copy link
Collaborator Author

PEZ commented Jan 1, 2025

Thanks again! It's a new year here now. 🎉

The problem with using direct recursion is that it generates a different, less performant, way of doing the function call. Here are benchmarks:

Nested anonymous function

Benchmarking Clojure Native
Benchmark 1: ./clojure-native-image/code 40
  Time (mean ± σ):     307.6 ms ±   4.4 ms    [User: 302.6 ms, System: 3.5 ms]
  Range (min … max):   299.8 ms … 312.6 ms    7 runs

Direct recursion

Benchmarking Clojure Native
Benchmark 1: ./clojure-native-image/code 40
  Time (mean ± σ):      1.992 s ±  0.009 s    [User: 1.975 s, System: 0.010 s]
  Range (min … max):    1.982 s …  2.003 s    7 runs

We can see it in a decompile:

public final class code$fibonacci extends AFunction implements LO
{
    public static final Var __code_fibonacci;
    
    public static Object invokeStatic(final long n) {
        return (n < 2L) ? Numbers.num(RT.longCast(n)) : Numbers.unchecked_add(((LO)__code_fibonacci.getRawRoot()).invokePrim(n - 1L), ((LO)__code_fibonacci.getRawRoot()).invokePrim(n - 1L - 1L));
    }
    
    @Override
    public Object invoke(final Object o) {
        return invokeStatic(RT.uncheckedLongCast(o));
    }
    
    @Override
    public final Object invokePrim(final long n) {
        return invokeStatic(n);
    }
    
    static {
        __code_fibonacci = RT.var("code", "fibonacci");
    }
}

It's those .getRawRoot() that are costly.

However the calculations do look more straight forward so I will now try to apply those to the nested lambda version.

@PEZ
Copy link
Collaborator Author

PEZ commented Jan 1, 2025

Found a solution. It ain't pretty. 😄

(ns code
  (:gen-class))

(set! *unchecked-math* :warn-on-boxed)

(definterface IFib
  (^long fib [^long n]))

(deftype Fibonacci []
  IFib
  (fib [_  n]
    (if (or (zero? n)
            (== 1 n))
      (long n)
      (long (+ (.fib _ (- n 1))
               (.fib _ (- n 2)))))))

(def ^:private ^Fibonacci fibonacci (Fibonacci.))

(defn -main [& args]
  (let [u (long (parse-long (first args)))
        r (loop [i 1
                 sum 0]
            (if (< i u)
              (recur (inc i) (+ sum (long (.fib fibonacci i))))
              sum))]
    (println r)))

The decompile of the Fibonacci type shows that the Clojure compiler deals with this much better:

public final class Fibonacci implements IFib, IType
{
    public static IPersistentVector getBasis() {
        return Tuple.create();
    }
    
    public long fib(final long n) {
        final boolean or__5600__auto__12362 = Numbers.isZero(n);
        return (or__5600__auto__12362 ? or__5600__auto__12362 : Numbers.equiv(1L, n)) ? n : (((IFib)this).fib(n - 1L) + ((IFib)this).fib(n - 2L));
    }
}

And then PGO can be used without fear of detoriation. And it makes a significant difference:

languages-visualizations-fibonacci-clojure-pgo

@He-Pin
Copy link
Contributor

He-Pin commented Jan 1, 2025

scala-native/scala-native#4132

Clojure rocks, this project do get some interesting tide on performance.

@PEZ
Copy link
Collaborator Author

PEZ commented Jan 1, 2025

It also seems my M1 loves Clojure extra much (not strange, I've been coding Clojure a lot on it for three years). On M4 Clojure does not win, even if it still performs very well.

@Gkodkod
Copy link
Contributor

Gkodkod commented Jan 2, 2025

Hi,

Thanks for creating this project, @bddicken! 🙏

I have been running benchmarks in some different ways the last few days. I notice that a lot of the difference between the Java performance and C is about JVM start times. And for Clojure the penalty is much bigger, because then also Clojure needs to start before the benchmarks can start crunching. To “see” the start time I created a hello-world benchmark, where the programs only print “Hello World!”. Here's a run on my Mac M4:

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):       2.2 ms ±   0.2 ms    [User: 0.7 ms, System: 1.2 ms]
  Range (min … max):     2.0 ms …   2.5 ms    10 runs
 

Benchmarking Java
Benchmark 1: java jvm.code 40
  Time (mean ± σ):      38.7 ms ±   0.9 ms    [User: 14.1 ms, System: 21.4 ms]
  Range (min … max):    37.3 ms …  40.2 ms    10 runs
 

Benchmarking Clojure
Benchmark 1: java -cp clojure/classes:src:/Users/pez/.m2/repository/org/clojure/clojure/1.12.0/clojure-1.12.0.jar:/Users/pez/.m2/repository/org/clojure/core.specs.alpha/0.4.74/core.specs.alpha-0.4.74.jar:/Users/pez/.m2/repository/org/clojure/spec.alpha/0.5.238/spec.alpha-0.5.238.jar code 40
  Time (mean ± σ):     275.0 ms ±   4.3 ms    [User: 410.4 ms, System: 43.4 ms]
  Range (min … max):   270.0 ms … 285.0 ms    10 runs
 

Benchmarking Babashka
Benchmark 1: bb -cp clojure -m code 40
  Time (mean ± σ):      11.9 ms ±   0.4 ms    [User: 3.3 ms, System: 6.1 ms]
  Range (min … max):    11.1 ms …  12.4 ms    10 runs

Basically the JVM takes 40ms to start, and then Clojure takes another 240ms. To be compared with the benchmark times for e.g. C, Java, and Clojure, which run at 250, 350, and 550 ms, respectively on the same machine.

I can submit the hello-world benchmark as a PR, if you're interested. Start times can be very important for e.g. shell scripting purposes and such. In the Clojure world we have Babashka which is interpreted Clojure. It starts super quickly, but of course can't compete with compiled languages in the other benchmarks.

Some questions:

  • Is this something you have though any about?

  • Would you be interested in me exploring ways to adapt the benchmark to give these VM-hosted languages a fairer run? Some ideas that so far has popped to my mind:

    • Include the hello-world benchmark and then use it to subtract the times there from the times in the other benchmarks.
    • Another is to add some way to use different benchmarking tools. So for the JVM, use some tool that runs the programs from inside the JVM.
    • Make the programs take a second argument specifying a number of runs. So the fibonacci 40 would be run x times, and with a significantly high x, the startup times would play much lesser role.
    • Extend the runs by running more loops and higher fibs.
    • Other than that I have no more ideas right now 😄 But am willing to start sourcing ideas and doing experiments and such.

WDYT?

Happy New Years @bddicken, @PEZ and congratulations on the Clojure optimization. Love your persistence, and so awesome of you!!!

It's very interesting, and I believe we should try to optimize the Java, Scala, C, C++, and other languages to the same degree, if possible. This will truly be descriptive of the concept that @bddicken originally created, rather than most of the stress on 1 language. It's also thought-provoking that some perform with very similar results.

In terms of the JVM start times, I've given it some thought after seeing your comment above, which missed yesterday. The differences in startup times between languages like C and VM-hosted ones like Java or Clojure can really skew the benchmarks, especially for something like "Hello, World!" where the startup overhead dominates. It’s frustrating because, in the real world, that startup cost often doesn’t matter for long-running tasks, but it’s huge for short-lived ones, like scripts.

I love that you’re looking into ways to make the benchmarks fairer. Personally, I think it’s worth exploring. There’s so much to unpack here. For example:

1. Including a startup benchmark and subtracting it: I like this idea—it makes sense for short-lived tasks. But startup time isn’t always constant (e.g., class loading and initialization complexity can vary), so it’s worth testing how accurate this adjustment would actually be.

2. Using tools that run inside the JVM: For Java and Clojure, running the benchmarks within the JVM, like with JMH (Java Microbenchmark Harness), is a great idea. It would give a much clearer picture of their steady-state performance without startup times muddying the waters. Of course, it doesn’t help much if startup time is what you’re measuring. Tools like JMH tackle JVM warm-up effects but don’t directly address startup time differences.

For Java and Clojure, running benchmarks within the JVM (e.g., using JMH) provides a more realistic view of performance by excluding startup costs. However, this shifts the focus away from startup time, which is sometimes the very thing you’re trying to measure.

3. Running multiple iterations: This is probably my favorite approach. If you add a second argument to run Fibonacci 40, say, 1000 times, startup time becomes a much smaller factor. It’s a simple way to make the results more reflective of long-running use cases. For example, running Fibonacci 40 a million times will make startup time almost negligible. It reflects the typical usage pattern for long-running applications.

4. Extending loops or using more complex workloads: Similar to the above, this reduces the startup impact, but we'd need to be careful about non-linear algorithms like Fibonacci since they can skew the results in other ways. This could disproportionately favor certain languages. Maybe adding some real-world scenarios, like computational pipelines, file processing, or network I/O, could balance things out, but then we are entering another realm of library implementations of such scenarios.

I'm actually working on scaling and optimizing computational pipelines in Spark getting lots of headaches, and thank God for coffee ! 😄

Honestly, your work here is super valuable. The tech community—especially places like Reddit and others—has been calling out these benchmarking quirks for years, and it’s awesome to see someone taking it seriously. If you’re planning on submitting a PR, I think you’d get a lot of support. Just be sure to document your approach clearly and explain when adjustments (like subtracting startup times) are appropriate and when they’re not.

At the end of the day, startup time matters a lot for things like scripting, and tools like Babashka really shine there (1st for me to hear of it and sounds like something we'd call our grandma or use it as a sweet nick nama, or is that Babushka.) 😄

For long-running processes, it’s a completely different story. These benchmarks could help show the full picture and make these comparisons more meaningful for different use cases. Exploring ways to make benchmarks more representative of real-world use cases is quite valuable, and perhaps we should explore that as well.

Again, love the visuals!!! ❤️

WDYT ? Ben ?

@Gkodkod
Copy link
Contributor

Gkodkod commented Jan 2, 2025

If you’re open to suggestions, here’s a plan of action that might help make the benchmarks even more meaningful:

1. Benchmark Categories: Split the benchmarks into startup-focused and execution-focused groups to reflect different scenarios. This way, you can highlight where each language shines (or struggles).

2. Report Breakdown: Include startup time as a separate metric alongside execution time for VM-hosted languages. This would give readers a clearer picture of what’s really happening, especially for short-lived vs. long-running tasks.

3. Collaborate with the Community: Platforms like Reddit, and Hacker News are great for bouncing ideas off others. They often uncover edge cases or optimizations we might overlook. I’ve seen this firsthand with a performance benchmark comparing languages for concurrent task handling originally created by a European PHD and then taken over by a .Net anonymous enthusiast. It got a lot of attention but also backlash because the results were flawed, and the developer resisted collaboration. The situation could’ve been a great opportunity to learn and improve, but instead, it became contentious. Here’s the link if you’re curious: Hacker News Discussion.

4. Use Real-World Workloads: Beyond "Hello, World!" or Fibonacci, include tasks like file I/O, network requests, or concurrent processing. These reflect scenarios developers actually encounter.

BTW, thanks so much for testing my suggestion yesterday... I'm stunned by what you came up with today... Kudus and WOW on the OOP version !!!

I think these steps could really elevate the benchmarks and make them valuable to a wide audience, especially in ML and data science as well. Would love to see where we take this.

@PEZ
Copy link
Collaborator Author

PEZ commented Jan 2, 2025

I believe we should try to optimize the Java, Scala, C, C++, and other languages to the same degree

To clarify my changes. They were only needed because the Clojure compiler happened to hit a problem with C2 Escape Analysis. I don't think any other language has a similar problem, so there are no similar work needed for them.

For any language that we compile with GraalVM native-image, we can easily add PGO. I can do that for Java, but for Scala it is a bit too opaque for me how the native compile is done, so someone with more knowledge about Scala than I have needs to have a look.

Since my experiment with subtracting start times failed so horribly, I really don't see a path forward there. As I can't spend the time to get language-specific benchmarking tools up and running for each language, and I don't have the patient to wait much longer for the benchmarks to run, my current approach has been to go for each language's native compile. It brings the start time problem down significantly.

In that effort I have investigated if we can do native builds for Julia, but I don't think we can. It can package standalone executables, but they will not be native compiles, like the JVM and .Net builds are. I have also tried, but not yet succeeded in compiling C# and F# with AOT on my machine.

As for reporting start time separately, that's what the hello-world benchmark is about. I think just publishing a few reports with hello-word results will help. I'm considering removing the start-time-mode from my visualizer and just add hello-world side by side with the other benchmarks.

PEZ added a commit that referenced this issue Jan 2, 2025
Java: Enable native-image PGO compile as mentioned in #220 comments
@PEZ
Copy link
Collaborator Author

PEZ commented Jan 2, 2025

Java PGO compiles are enabled now.

Make Java move in among the top tier of 1B loops:

image

Doesn't still outperform the Clojure Fibonacci:

Java PGO Fibonacci

(Probably we can take the disassmbled Clojure compile and make the Java contribution from that to make them equally good, haha.)

@Gkodkod
Copy link
Contributor

Gkodkod commented Jan 3, 2025

@PEZ Excited to catch up next Sunday and looking forward to having more fun face-to-face with @bddicken! I'm a bit under the weather at the moment but hope to feel better by then. In the meantime, I’ll leave some comments here.

I’m also catching up on over 140 new updates from others (just fetched)—it’s heartwarming to see so much progress! I also realized I have some code on my branch that I forgot to check in for your review.

I absolutely love what you’ve done so far—you’ve practically "read my mind" regarding Java. I saved your optimized Clojure code and have been reflecting on it further.

As for Native and PGO, I’ve come up with this idea (apologies, my bed is calling, so I couldn’t prepare a full PR). Perhaps you could give it a try? I’m curious if it could further improve our results:

package jvm;

import java.util.concurrent.TimeUnit;
import jdk.incubator.foreign.MemorySegment;
import jdk.incubator.foreign.ValueLayout;

/**
 * Optimization Techniques:
 * Used @CompilerHint annotation to provide optimization hints
 * Added warm-up phase to help JIT compiler
 * Included performance timing
 * Maintained original recursive approach
 */

public class code {
    // Inline-friendly recursive Fibonacci implementation
    private static int fibonacci(int n) {
        if (n <= 1) return n;
        return fibonacci(n - 1) + fibonacci(n - 2);
    }

    // JIT and Native Image optimization hint
    @CompilerHint(inline = true)
    private static long computeFibonacciSum(int upperBound) {
        long result = 0;
        for (int i = 1; i < upperBound; i++) {
            result += fibonacci(i);
        }
        return result;
    }

    public static void main(String[] args) {
        // Input validation
        if (args.length == 0) {
            System.err.println("Please provide an upper bound as an argument.");
            System.exit(1);
        }

        try {
            int upperBound = Integer.parseInt(args[0]);
            
            // Warm-up phase for PGO
            for (int warmup = 0; warmup < 1000; warmup++) {
                fibonacci(10);
            }

            // Actual computation with timing
            long startTime = System.nanoTime();
            long result = computeFibonacciSum(upperBound);
            long endTime = System.nanoTime();

            System.out.println("Result: " + result);
            System.out.printf("Computation time: %.3f ms%n", 
                (endTime - startTime) / 1_000_000.0);
        } catch (NumberFormatException e) {
            System.err.println("Invalid input. Please provide a valid integer.");
            System.exit(1);
        }
    }

    // Annotation to help JIT and Native Image optimization
    @interface CompilerHint {
        boolean inline() default false;
    }
}

Bash Script - run.sh

#!/bin/bash
# PGO Native Image
# Run this script from the fibonacci/jvm directory
# Prerequisites: 
# - Install GraalVM
# - Set JAVA_HOME to GraalVM directory

# Compile Java source
javac --enable-preview optimized_code.java

# Generate profiling data
java -XX:+UnlockDiagnosticVMOptions \
     -XX:+TraceClassLoading \
     -XX:+LogCompilation \
     -XX:LogFile=fibonacci.log \
     optimized_code 30

# Create native image with PGO
native-image \
    --enable-preview \
    -H:Name=fibonacci \
    -H:+ReportExceptionStackTraces \
    --initialize-at-build-time \
    --no-fallback \
    optimized_code

# Run native executable
./optimized_code 30

Thank you for all the amazing progress and have a lovely weekend !

@PEZ
Copy link
Collaborator Author

PEZ commented Jan 19, 2025

Here's a PR aimed at fixing this issue:

It is a start, with reference implementations of tooling and all benchmarks for Clojure, Java, and C. I'm thinking that we probably can crowd source the remaining transition reasonably fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants