The benchmark penalizes languages that need to start a runtime environment (e.g. anything on the JVM or .Net) #220

PEZ · 2024-12-06T22:33:09Z

Hi,

Thanks for creating this project, @bddicken! 🙏

I have been running benchmarks in some different ways the last few days. I notice that a lot of the difference between the Java performance and C is about JVM start times. And for Clojure the penalty is much bigger, because then also Clojure needs to start before the benchmarks can start crunching. To “see” the start time I created a hello-world benchmark, where the programs only print “Hello World!”. Here's a run on my Mac M4:

Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):       2.2 ms ±   0.2 ms    [User: 0.7 ms, System: 1.2 ms]
  Range (min … max):     2.0 ms …   2.5 ms    10 runs
 

Benchmarking Java
Benchmark 1: java jvm.code 40
  Time (mean ± σ):      38.7 ms ±   0.9 ms    [User: 14.1 ms, System: 21.4 ms]
  Range (min … max):    37.3 ms …  40.2 ms    10 runs
 

Benchmarking Clojure
Benchmark 1: java -cp clojure/classes:src:/Users/pez/.m2/repository/org/clojure/clojure/1.12.0/clojure-1.12.0.jar:/Users/pez/.m2/repository/org/clojure/core.specs.alpha/0.4.74/core.specs.alpha-0.4.74.jar:/Users/pez/.m2/repository/org/clojure/spec.alpha/0.5.238/spec.alpha-0.5.238.jar code 40
  Time (mean ± σ):     275.0 ms ±   4.3 ms    [User: 410.4 ms, System: 43.4 ms]
  Range (min … max):   270.0 ms … 285.0 ms    10 runs
 

Benchmarking Babashka
Benchmark 1: bb -cp clojure -m code 40
  Time (mean ± σ):      11.9 ms ±   0.4 ms    [User: 3.3 ms, System: 6.1 ms]
  Range (min … max):    11.1 ms …  12.4 ms    10 runs

Basically the JVM takes 40ms to start, and then Clojure takes another 240ms. To be compared with the benchmark times for e.g. C, Java, and Clojure, which run at 250, 350, and 550 ms, respectively on the same machine.

I can submit the hello-world benchmark as a PR, if you're interested. Start times can be very important for e.g. shell scripting purposes and such. In the Clojure world we have Babashka which is interpreted Clojure. It starts super quickly, but of course can't compete with compiled languages in the other benchmarks.

Some questions:

Is this something you have though any about?
Would you be interested in me exploring ways to adapt the benchmark to give these VM-hosted languages a fairer run? Some ideas that so far has popped to my mind:
- Include the hello-world benchmark and then use it to subtract the times there from the times in the other benchmarks.
- Another is to add some way to use different benchmarking tools. So for the JVM, use some tool that runs the programs from inside the JVM.
- Make the programs take a second argument specifying a number of runs. So the fibonacci 40 would be run x times, and with a significantly high x, the startup times would play much lesser role.
- Extend the runs by running more loops and higher fibs.
- Other than that I have no more ideas right now 😄 But am willing to start sourcing ideas and doing experiments and such.

WDYT?

The text was updated successfully, but these errors were encountered:

PEZ · 2024-12-06T22:57:18Z

See also:

Add Clojure and Babashka (both loops and fibonacci) #162

Where there is some more discussion and benchmark results for Clojure and Java.

AndrewBabbitt97 · 2024-12-19T12:15:09Z

TBH for .NET the default and only option should just be NativeAOT with optimizations set for speed. For any of these kind of benchmarks running non-aot is just putting a handicap on the language / runtime.

PEZ · 2024-12-19T13:49:12Z

@AndrewBabbitt97 I'm beginning to think the same.

giordano · 2024-12-21T07:12:00Z

The same applies to Julia, which is a just-in-time compiled language: you're mainly measuring startup time, the time to run the function is a fraction of it. I slightly modified the code to be able to measure runtime:

diff --git a/levenshtein/julia/code.jl b/levenshtein/julia/code.jl
index 04ab029..af88189 100644
--- a/levenshtein/julia/code.jl
+++ b/levenshtein/julia/code.jl
@@ -45,8 +45,8 @@ function levenshtein_distance(s1::String, s2::String)::Int
     return prev_row[m + 1]
 end
 
-function main()
-    if length(ARGS) < 2
+function main(strings::Vector{String})
+    if length(strings) < 2
         println("Please provide at least two strings as arguments.")
         exit(1)
     end
@@ -55,10 +55,10 @@ function main()
     times = 0
 
     # Compare all pairs of strings
-    for i in 1:length(ARGS)
-        for j in 1:length(ARGS)
+    for i in eachindex(strings)
+        for j in eachindex(strings)
             if i != j
-                distance = levenshtein_distance(ARGS[i], ARGS[j])
+                distance = levenshtein_distance(strings[i], strings[j])
                 if min_distance == -1 || distance < min_distance
                     min_distance = distance
                 end
@@ -67,11 +67,12 @@ function main()
         end
     end
 
-    println("times: ", times)
-    println("min_distance: ", min_distance)
+    return times, min_distance
 end
 
 # Run main only if script is run directly
 if abspath(PROGRAM_FILE) == @__FILE__
-    main()
+    times, min_distance = main(ARGS)
+    println("times: ", times)
+    println("min_distance: ", min_distance)
 end

then what you're measuring is

$ input=($(cat ../input.txt))
$ cmd="julia code.jl ${input[@]}"
$ hyperfine --shell=none "${cmd}"
Benchmark 1: julia code.jl aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb cccccccccccccccccccccccccccccccccccccccccccccccccccc ddddddddddddddd eeeeeeeeeeeeeeeee ffffffffffffff ggggggggggg hhhhhhhhhhhhhhhhhhhhhhhhhhhhh iiiiiiiiiiiiiiiiiiiiiiiiiiiii jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk llllllllllllllllllllllllllllllllllllllllllllll mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ooooooooooooooooooooooooooooooooo pppppppppppppppppppppppppppppppppppppppppp qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq rrrrrrrrrrrrrrrrrrrrrrrrrrrrr ssssssssssssssssssssssssssssss tttttttttttttttttttttttttt uuuuuuuuuuuuuuuuuuuuuuuuu vvvvvvvvvvvvvvvvvvvvvvvvvvvvv wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm 1234567890 QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890 mnxbckfjquwhduywetdftfqrcfgzbxmcsdlsppweuouhftvjcsmnbsdvuqtfvcjbxmcnbkaoqwerpweiurigjhdfjbmsbjgqiuyeiruyoisbnsvuyfguweygrhoaiosjdkjhfbsjhdvfkqjnoi gaywgeudgaewytfvajsdvfjahsdvfjhsdfjhgsfjhqowiueryiuweytruiyytfyqvhgvzxncbsidfjdflsdfpwoeiriuwheuyfvydtvqyuygbhskdjfbkajbuywgduywegfvsnmwehriygwuerygwerdfjhsdfsdkfhbsjdhfjuavufywefhuwlihfoqwehqrwheiruyuyqgruwieqrhsdjhfbjwhvdfisdufhikquhweifgbwfisdhfksjdhfiwuh ASDFHWERFIWSHHSDFJWERPWFOAHDSDFXMCVNWMERKJQEOQIWRUQIHJSBFSDFKJSDFOWIERIOU werkjhiauwryisydgcfkjsdfkjsjdhfguiyqgtfsdfSDGFDSFGwsrhqwigdeyiwefDFBdkqedfgdFHGHKJiFHTYRGFsefjhwhsgdfjhhjsdfDWEfsdfWEFfbgFGjYuioIOpvbnVBNSDFadfsSDFwegsdgfAWFDFGDFghjTyIGHJREGFsddqwdsdfweaWQAZZSDGgnlpHmHJMgkOYUTDFGSFSDFSDFDHGDFGSDFDGRFjhbjshdgfjhgsdfSDGDFG kjsdfkhgwieyfguwsyefgsdfSDFGJSDAKFDSAFIRFYUDSFHSBVXCVBNMSDKFJWOYSDFHKADSDnfsdfjbjshdbfkwjbfksdbfhwbkyguygyfshbcdmnxbvcmnsdkfsdfsdflspflwoekorwueiruygwuygjshbvnbvzcjsuhfdiuhsdfghkwjhdfiwegfjdhsgfbksjhfksdhgfhsdhfghfgfgfsjdjfhkwjehoueuq abcdefghijklmnopqrstuvwxyz thistringisnottoolong howcanwemakeabetterstringforcomparing absfeiihdkkeiurmnslkspoiqwrjlnjna jsfjlqpowiuewriugbsdmfasgdfqsfwwiewurmxbcaslkdjp qwerdsdftffygfuhbuhiok sfhdgafsdhafsdjaweqwueytuqwyefgagsdgfasdfajfgsieufowerpoafposdfmnxbcvjlkhhgdfkjasfguysdfuayfgjhvf ygiuwoqupqowei sjxvkzjha hfgauyiuyrwer sdf dsfgsdfdfg dfgdswefdsdgd ffhfghsdfghsdfeuryiwuyeriudhfmnsbmnbvsgdfkuweifgwefgisdgf sdfskdfhgksugfweriuwersdkfjbsewoir sfdqrieyft sdfnueyrtuwyerowierpoipmvmcxnvmnbsdnfbajeygruywgefugsdbaqwriuweuywiuyer dsfjgfuywegfugsufgqfgdjsdhfjhsdgfhsgdfiweoruoweruwer sdfhyairuywoerfuetuopiufudsvxcbvmsdflksfwpeoriiueief dsfygweurytwueyr sdfshfgwegfisfisdgfowuyeridfs sdfhsdfgsudyfguwyef sdfjsgdfysgdfuwgueyfs dfsgdfusgdfiweoriuwqoiru ygdifuwuoerupweurpquerowehmnxcbvmnxbvjcquqyerypeiproweur fhsdgfygsdfuygsudfgufrpesifgsdfshdfbuwyveugfud fefgiuyegsiufygdfigreoquyarcfdbnmnsbfdlsdfoiweprisdfqbyefg sdfugwuefgusdygfuygwdufytwuqytueryt efsgdvfuysgfdusydfuywguyeryoqurwreiueieieiiii
  Time (mean ± σ):     192.0 ms ±   7.8 ms    [User: 369.2 ms, System: 30.7 ms]

but the actual run time spent in the main function is

% julia -L code.jl -q
julia> using BenchmarkTools

julia> @benchmark main(strings) setup=(strings=String.(split(readchomp("../input.txt"), ' ')))
BenchmarkTools.Trial: 195 samples with 1 evaluation.
 Range (min … max):  24.934 ms …  28.091 ms  ┊ GC (min … max): 0.00% … 8.00%
 Time  (median):     25.646 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.715 ms ± 379.302 μs  ┊ GC (mean ± σ):  0.33% ± 0.98%

                ▂▂▃▂█▅█  ▁
  ▄▁▁▃▄▃▃▃▄▁▄▆▆▇███████▇██▅▄▅▅▄▅▆▄▅▄▅▃▄▁▁▄▁▁▃▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
  24.9 ms         Histogram: frequency by time         27.1 ms <

 Memory estimate: 2.07 MiB, allocs estimate: 15628.

Edit: just noticed this issue overlaps with #149.

Gkodkod · 2024-12-31T20:18:26Z

Hey @PEZ

In regards to your Linkedin post you wrote today and our conversation there:

My prime suspect for why the PGO enabled compile hurts the Clojure native build for the fibonacci program is C2 JIT compiler Escape Analysis. In theory it should detect that the variables in the recursion can't escape, but instead it produces some code that causes the heap to be used and that in turns wakes the GC at some point.

I'm sorry I can't really describe it well, because I don't quite understand it well enough. Here's Escape Analysis explained:

https://lnkd.in/ddNnb95r

I think that during one, or only a few, runs through the fibonacci(40), C2 is never asked to do any escape analysis (or possibly not in play at all). I'm not sure what happens during the PGO enabled compile, but thinking the profiling data puts C2 escape analysis into play.

I can trigger the slowdown by doing something like:

(dotimes [i 20] (println i) (time (-main "40")))

Then the first 10 or so runs it will run at, say 300ms, and then they start to run in 700ms instead.

Anyway, what I really would want to do is to write the implementation in such a way that the escape analysis succeeds (if, indeed, this is the correct problem) and so that I can enable PGO without hurting performance (in theory there is some performance to unlock, right?). If someone wants to have a look and experiment with it, please do. I am not knowledgeable enough to formulate new ideas of things to try.

Let me try to explain... and suggest ideas to try

The core problem is likely related to how the C2 JIT compiler performs Escape Analysis during the Profile-Guided Optimization (PGO) process:

In the naive recursive Fibonacci implementation, the function creates many small, short-lived objects during recursive calls.
Normally, the JIT compiler's Escape Analysis should:

Detect these objects don't "escape" the method
Eliminate heap allocations
Replace object allocations with scalar operations
Keep everything in registers or stack

However, during PGO, something goes wrong:

The profiling data might trigger a different optimization path
Instead of eliminating allocations, it forces heap allocations
This causes unnecessary garbage collection overhead

The performance degradation occurs because:

Heap allocations are created where they shouldn't be
Garbage collector gets triggered more frequently
Register/stack optimizations are bypassed

The key observation is the inconsistent behavior across multiple runs, where performance suddenly degrades after a few iterations. Not sure if my explanation is clear enough, yet I got some ideas for you. I'll provide a few alternative Clojure Fibonacci implementations that might help circumvent the Escape Analysis issue:

Primitive-Focused Recursive Implementation:

;;;; Key optimizations:
;;;; 1. Use ^long type hint
;;;; 2. Use unchecked- operations to avoid boxing
;;;;  3. Explicit long arithmetic
;;;;  4. Minimal object creation
(defn fibonacci-primitive [^long n]
  (if (< n 2)
    (long n)
    (+ (fibonacci-primitive (unchecked-dec n))
       (fibonacci-primitive (unchecked-dec (unchecked-dec n))))))

Pros:

Closest to the original naive recursive implementation
Type hints help JVM understand primitive types
Minimal overhead

Cons:

Exponential time complexity O(2^n)
High risk of stack overflow for large n
May still create intermediate objects

Tail-Recursive with Primitives:

;;;; Key optimizations:
;;;; 1. Tail recursion (easier for JIT to optimize)
;;;; 2. Primitive loop variables
;;;;  3. No intermediate object allocations
;;;;  4. Direct arithmetic
(defn fibonacci-tailrec [^long n]
  (loop [a (long 0)
         b (long 1)
         count (long n)]
    (if (zero? count)
      a
      (recur b 
             (unchecked-add a b) 
             (unchecked-dec count)))))

Pros:

Linear time complexity O(n)
Constant space complexity
JVM can optimize tail recursion
No intermediate object allocations

Cons:

Less readable than recursive approach
Requires understanding of tail-call optimization

Memoized Functional Approach

;;;; Key optimizations:
;;;; 1. Memoization prevents redundant calculations
;;;; 2. Primitive type hints
;;;;  3. Cached results reduce computational overhead
(def fibonacci-memo 
  (memoize 
   (fn [^long n]
     (if (< n 2)
       (long n)
       (+ (fibonacci-memo (unchecked-dec n))
          (fibonacci-memo (unchecked-dec (unchecked-dec n))))))))

Pros:

Caches previous calculations
Dramatically reduces computational complexity
Preserves functional programming style

Cons:

Uses memory to store memoized results
Overhead of memoization for small n
Potential memory leak if not bounded

Transient Iterative Approach:

(defn fibonacci-transient [^long n]
  (loop [a (long 0)
         b (long 1)
         count (long n)]
    (if (zero? count)
      a
      (recur b 
             (unchecked-add a b) 
             (unchecked-dec count)))))

Pros:

Most predictable performance
Lowest memory overhead
Easiest for JIT to optimize
Linear time and space complexity

Cons:

Least "functional" approach
Imperative style
Less elegant than recursive solutions

These impl suggestions, of course, need to be highly benchmarked, using hyperfine like all other languages. Use primitive-type hints/arithmetic while minimizing objs creation and preferring tail recursion or iteration. Also, they make escape analysis "obvious," reducing GC pressure while making it easier for JIT to optimize. They should work well with with Profile-Guided Optimization, I hope.

Let @bddicken and I know your findings, Master and a most Happiest New Years !

PEZ · 2024-12-31T21:28:10Z

Thanks, @Gkodkod ❤️

I have spent a significant amount of time in this repo rejecting contributions that go for, memoization, tail recursion, and iteration, since it is not in the spirit of the benchmark which seeks to measure a language's function call overhead performance. So we can't really go there with this one.

As for the primitive focused recursive approach, I think that's largely what we do already:

(defn- fibonacci ^long [^long n]
  ((fn fib ^long [^long n]
    (if (<= n 1)
      n
      (+ (long (fib (- n 1)))
         (long (fib (- n 2)))))) n))

Not that we actually recurse in a named lambda. This is because using a namespaced function for it, makes the Clojure compiler produce a different kind of function call which hurts performance badly. I think that it is something with this “private” function where C2 goes wrong.

The type hints and casts are enough for the Clojure compiler to understand that it should do unchecked integer math. The above function, when compiled, decompiles to this:

// Decompiling class: code$fibonacci$fib__10924
import clojure.lang.*;

public final class code$fibonacci$fib__10924 extends AFunction implements LL
{
    @Override
    public final long invokePrim(final long n) {
        return (n <= 1L) ? n : (RT.longCast(this.invoke(Numbers.num(Numbers.unchecked_minus(n, 1L)))) + RT.longCast(this.invoke(Numbers.num(Numbers.unchecked_minus(n, 2L)))));
    }
    
    @Override
    public Object invoke(final Object o) {
        return ((IFn.LL)this).invokePrim(RT.uncheckedLongCast(o));
    }
}

As far as I know that's as good as it gets. And we are looking at the benchmark winning code, which I think agrees with that assessment.

I've also tried using a let binding for the function, and let it live inside -main, but it triggers the same PGO (and long-running) issue, and I opted for the current solution because it's a bit easier to see what's going on (at least for me).

Gkodkod · 2025-01-01T00:13:43Z

@PEZ Thank you for explaining your thoughts about this so eloquently. I also appreciate the produced decompiled Java code with your winning code.

With that being said, allow me to compare this code to my suggested Primitive-Focused Recursive implementation with regards to performance differences, and of course, you can decide which way you want to go.

Current Code:

 (defn- fibonacci ^long [^long n]
  ((fn fib ^long [^long n]
    (if (<= n 1)
      n
      (+ (long (fib (- n 1)))
         (long (fib (- n 2)))))) n))

My Suggested Implementation:

(defn fibonacci-primitive [^long n]
  (if (< n 2)
    (long n)
    (+ (fibonacci-primitive (unchecked-dec n))
       (fibonacci-primitive (unchecked-dec (unchecked-dec n))))))

Key Differences:

1. Nested Anonymous Function vs Direct Recursion

Existing: Uses an inner anonymous function (fn fib ...)
Suggested: Direct recursive function definition

2. Explicit Type Casting

Existing: Uses (long ...) to cast recursive calls
Suggested: Relies more on type hints and unchecked- operations

3. Condition Check

Existing: (<= n 1)
Suggested: (< n 2) - slightly more idiomatic

4. Arithmetic Operations

Existing: Uses standard subtraction (- n 1)
Suggested: Uses unchecked-dec for potential performance gain

5. Complexity

Existing: Slightly more complex due to nested function
Suggested: More straightforward, direct recursion

The performance implications between the two are slight and might make a difference to the decompiled Java code:

The existing code introduces an extra function call per recursion
The extra (long ...) casting might create more intermediate objects
My suggestion aims to minimize object creation and function call overhead, which as you mentioned, hurts performance badly.

My suggested implementation is likely to be more friendly to Escape Analysis because it could potentially: produce fewer function calls, less explicit type casting, have more direct object lifetime, and uses unchecked- operations which hint at primitive behavior. Again, it is your call to make if you would like to try testing it. I would love to see if it improved the actual test results and the decompiled Java including the main as can further analyze them both.

I wish had more time to spend on the startup time issues you've mentioned, and actually had some ideas, yet I cannot offer any advice without clearly understanding the issues from you. It is a brillant idea to include it in your latest benchmark visualization. Additionally, allow me to once more thank you for the most excellent time and work you've put into the whole repo and especially the most brillant benchmark visualization repo you put together. It's so amazing and helpful for so many!!!

❤️❤️❤️ Gott nytt år till dig och hela familjen! ❤️❤️❤️

PEZ · 2025-01-01T00:52:52Z

Thanks again! It's a new year here now. 🎉

The problem with using direct recursion is that it generates a different, less performant, way of doing the function call. Here are benchmarks:

Nested anonymous function

Benchmarking Clojure Native
Benchmark 1: ./clojure-native-image/code 40
  Time (mean ± σ):     307.6 ms ±   4.4 ms    [User: 302.6 ms, System: 3.5 ms]
  Range (min … max):   299.8 ms … 312.6 ms    7 runs

Direct recursion

Benchmarking Clojure Native
Benchmark 1: ./clojure-native-image/code 40
  Time (mean ± σ):      1.992 s ±  0.009 s    [User: 1.975 s, System: 0.010 s]
  Range (min … max):    1.982 s …  2.003 s    7 runs

We can see it in a decompile:

public final class code$fibonacci extends AFunction implements LO
{
    public static final Var __code_fibonacci;
    
    public static Object invokeStatic(final long n) {
        return (n < 2L) ? Numbers.num(RT.longCast(n)) : Numbers.unchecked_add(((LO)__code_fibonacci.getRawRoot()).invokePrim(n - 1L), ((LO)__code_fibonacci.getRawRoot()).invokePrim(n - 1L - 1L));
    }
    
    @Override
    public Object invoke(final Object o) {
        return invokeStatic(RT.uncheckedLongCast(o));
    }
    
    @Override
    public final Object invokePrim(final long n) {
        return invokeStatic(n);
    }
    
    static {
        __code_fibonacci = RT.var("code", "fibonacci");
    }
}

It's those .getRawRoot() that are costly.

However the calculations do look more straight forward so I will now try to apply those to the nested lambda version.

PEZ · 2025-01-01T17:10:55Z

Found a solution. It ain't pretty. 😄

(ns code
  (:gen-class))

(set! *unchecked-math* :warn-on-boxed)

(definterface IFib
  (^long fib [^long n]))

(deftype Fibonacci []
  IFib
  (fib [_  n]
    (if (or (zero? n)
            (== 1 n))
      (long n)
      (long (+ (.fib _ (- n 1))
               (.fib _ (- n 2)))))))

(def ^:private ^Fibonacci fibonacci (Fibonacci.))

(defn -main [& args]
  (let [u (long (parse-long (first args)))
        r (loop [i 1
                 sum 0]
            (if (< i u)
              (recur (inc i) (+ sum (long (.fib fibonacci i))))
              sum))]
    (println r)))

The decompile of the Fibonacci type shows that the Clojure compiler deals with this much better:

public final class Fibonacci implements IFib, IType
{
    public static IPersistentVector getBasis() {
        return Tuple.create();
    }
    
    public long fib(final long n) {
        final boolean or__5600__auto__12362 = Numbers.isZero(n);
        return (or__5600__auto__12362 ? or__5600__auto__12362 : Numbers.equiv(1L, n)) ? n : (((IFib)this).fib(n - 1L) + ((IFib)this).fib(n - 2L));
    }
}

And then PGO can be used without fear of detoriation. And it makes a significant difference:

He-Pin · 2025-01-01T17:24:08Z

scala-native/scala-native#4132

Clojure rocks, this project do get some interesting tide on performance.

PEZ · 2025-01-01T17:42:07Z

It also seems my M1 loves Clojure extra much (not strange, I've been coding Clojure a lot on it for three years). On M4 Clojure does not win, even if it still performs very well.

Gkodkod · 2025-01-02T02:00:17Z

Hi,

Thanks for creating this project, @bddicken! 🙏

I have been running benchmarks in some different ways the last few days. I notice that a lot of the difference between the Java performance and C is about JVM start times. And for Clojure the penalty is much bigger, because then also Clojure needs to start before the benchmarks can start crunching. To “see” the start time I created a hello-world benchmark, where the programs only print “Hello World!”. Here's a run on my Mac M4:
Benchmarking C
Benchmark 1: ./c/code 40
  Time (mean ± σ):       2.2 ms ±   0.2 ms    [User: 0.7 ms, System: 1.2 ms]
  Range (min … max):     2.0 ms …   2.5 ms    10 runs
 

Benchmarking Java
Benchmark 1: java jvm.code 40
  Time (mean ± σ):      38.7 ms ±   0.9 ms    [User: 14.1 ms, System: 21.4 ms]
  Range (min … max):    37.3 ms …  40.2 ms    10 runs
 

Benchmarking Clojure
Benchmark 1: java -cp clojure/classes:src:/Users/pez/.m2/repository/org/clojure/clojure/1.12.0/clojure-1.12.0.jar:/Users/pez/.m2/repository/org/clojure/core.specs.alpha/0.4.74/core.specs.alpha-0.4.74.jar:/Users/pez/.m2/repository/org/clojure/spec.alpha/0.5.238/spec.alpha-0.5.238.jar code 40
  Time (mean ± σ):     275.0 ms ±   4.3 ms    [User: 410.4 ms, System: 43.4 ms]
  Range (min … max):   270.0 ms … 285.0 ms    10 runs
 

Benchmarking Babashka
Benchmark 1: bb -cp clojure -m code 40
  Time (mean ± σ):      11.9 ms ±   0.4 ms    [User: 3.3 ms, System: 6.1 ms]
  Range (min … max):    11.1 ms …  12.4 ms    10 runs
Basically the JVM takes 40ms to start, and then Clojure takes another 240ms. To be compared with the benchmark times for e.g. C, Java, and Clojure, which run at 250, 350, and 550 ms, respectively on the same machine.

I can submit the hello-world benchmark as a PR, if you're interested. Start times can be very important for e.g. shell scripting purposes and such. In the Clojure world we have Babashka which is interpreted Clojure. It starts super quickly, but of course can't compete with compiled languages in the other benchmarks.

Some questions:

Is this something you have though any about?

Would you be interested in me exploring ways to adapt the benchmark to give these VM-hosted languages a fairer run? Some ideas that so far has popped to my mind:

Include the hello-world benchmark and then use it to subtract the times there from the times in the other benchmarks.

Another is to add some way to use different benchmarking tools. So for the JVM, use some tool that runs the programs from inside the JVM.

Make the programs take a second argument specifying a number of runs. So the fibonacci 40 would be run x times, and with a significantly high x, the startup times would play much lesser role.

Extend the runs by running more loops and higher fibs.

Other than that I have no more ideas right now 😄 But am willing to start sourcing ideas and doing experiments and such.

WDYT?

Happy New Years @bddicken, @PEZ and congratulations on the Clojure optimization. Love your persistence, and so awesome of you!!!

It's very interesting, and I believe we should try to optimize the Java, Scala, C, C++, and other languages to the same degree, if possible. This will truly be descriptive of the concept that @bddicken originally created, rather than most of the stress on 1 language. It's also thought-provoking that some perform with very similar results.

In terms of the JVM start times, I've given it some thought after seeing your comment above, which missed yesterday. The differences in startup times between languages like C and VM-hosted ones like Java or Clojure can really skew the benchmarks, especially for something like "Hello, World!" where the startup overhead dominates. It’s frustrating because, in the real world, that startup cost often doesn’t matter for long-running tasks, but it’s huge for short-lived ones, like scripts.

I love that you’re looking into ways to make the benchmarks fairer. Personally, I think it’s worth exploring. There’s so much to unpack here. For example:

1. Including a startup benchmark and subtracting it: I like this idea—it makes sense for short-lived tasks. But startup time isn’t always constant (e.g., class loading and initialization complexity can vary), so it’s worth testing how accurate this adjustment would actually be.

2. Using tools that run inside the JVM: For Java and Clojure, running the benchmarks within the JVM, like with JMH (Java Microbenchmark Harness), is a great idea. It would give a much clearer picture of their steady-state performance without startup times muddying the waters. Of course, it doesn’t help much if startup time is what you’re measuring. Tools like JMH tackle JVM warm-up effects but don’t directly address startup time differences.

For Java and Clojure, running benchmarks within the JVM (e.g., using JMH) provides a more realistic view of performance by excluding startup costs. However, this shifts the focus away from startup time, which is sometimes the very thing you’re trying to measure.

3. Running multiple iterations: This is probably my favorite approach. If you add a second argument to run Fibonacci 40, say, 1000 times, startup time becomes a much smaller factor. It’s a simple way to make the results more reflective of long-running use cases. For example, running Fibonacci 40 a million times will make startup time almost negligible. It reflects the typical usage pattern for long-running applications.

4. Extending loops or using more complex workloads: Similar to the above, this reduces the startup impact, but we'd need to be careful about non-linear algorithms like Fibonacci since they can skew the results in other ways. This could disproportionately favor certain languages. Maybe adding some real-world scenarios, like computational pipelines, file processing, or network I/O, could balance things out, but then we are entering another realm of library implementations of such scenarios.

I'm actually working on scaling and optimizing computational pipelines in Spark getting lots of headaches, and thank God for coffee ! 😄

Honestly, your work here is super valuable. The tech community—especially places like Reddit and others—has been calling out these benchmarking quirks for years, and it’s awesome to see someone taking it seriously. If you’re planning on submitting a PR, I think you’d get a lot of support. Just be sure to document your approach clearly and explain when adjustments (like subtracting startup times) are appropriate and when they’re not.

At the end of the day, startup time matters a lot for things like scripting, and tools like Babashka really shine there (1st for me to hear of it and sounds like something we'd call our grandma or use it as a sweet nick nama, or is that Babushka.) 😄

For long-running processes, it’s a completely different story. These benchmarks could help show the full picture and make these comparisons more meaningful for different use cases. Exploring ways to make benchmarks more representative of real-world use cases is quite valuable, and perhaps we should explore that as well.

Again, love the visuals!!! ❤️

WDYT ? Ben ?

Gkodkod · 2025-01-02T02:25:12Z

If you’re open to suggestions, here’s a plan of action that might help make the benchmarks even more meaningful:

1. Benchmark Categories: Split the benchmarks into startup-focused and execution-focused groups to reflect different scenarios. This way, you can highlight where each language shines (or struggles).

2. Report Breakdown: Include startup time as a separate metric alongside execution time for VM-hosted languages. This would give readers a clearer picture of what’s really happening, especially for short-lived vs. long-running tasks.

3. Collaborate with the Community: Platforms like Reddit, and Hacker News are great for bouncing ideas off others. They often uncover edge cases or optimizations we might overlook. I’ve seen this firsthand with a performance benchmark comparing languages for concurrent task handling originally created by a European PHD and then taken over by a .Net anonymous enthusiast. It got a lot of attention but also backlash because the results were flawed, and the developer resisted collaboration. The situation could’ve been a great opportunity to learn and improve, but instead, it became contentious. Here’s the link if you’re curious: Hacker News Discussion.

4. Use Real-World Workloads: Beyond "Hello, World!" or Fibonacci, include tasks like file I/O, network requests, or concurrent processing. These reflect scenarios developers actually encounter.

BTW, thanks so much for testing my suggestion yesterday... I'm stunned by what you came up with today... Kudus and WOW on the OOP version !!!

I think these steps could really elevate the benchmarks and make them valuable to a wide audience, especially in ML and data science as well. Would love to see where we take this.

PEZ · 2025-01-02T15:50:50Z

I believe we should try to optimize the Java, Scala, C, C++, and other languages to the same degree

To clarify my changes. They were only needed because the Clojure compiler happened to hit a problem with C2 Escape Analysis. I don't think any other language has a similar problem, so there are no similar work needed for them.

For any language that we compile with GraalVM native-image, we can easily add PGO. I can do that for Java, but for Scala it is a bit too opaque for me how the native compile is done, so someone with more knowledge about Scala than I have needs to have a look.

Since my experiment with subtracting start times failed so horribly, I really don't see a path forward there. As I can't spend the time to get language-specific benchmarking tools up and running for each language, and I don't have the patient to wait much longer for the benchmarks to run, my current approach has been to go for each language's native compile. It brings the start time problem down significantly.

In that effort I have investigated if we can do native builds for Julia, but I don't think we can. It can package standalone executables, but they will not be native compiles, like the JVM and .Net builds are. I have also tried, but not yet succeeded in compiling C# and F# with AOT on my machine.

As for reporting start time separately, that's what the hello-world benchmark is about. I think just publishing a few reports with hello-word results will help. I'm considering removing the start-time-mode from my visualizer and just add hello-world side by side with the other benchmarks.

Java: Enable native-image PGO compile as mentioned in #220 comments

PEZ · 2025-01-02T21:30:05Z

Java PGO compiles are enabled now.

Make Java move in among the top tier of 1B loops:

Doesn't still outperform the Clojure Fibonacci:

(Probably we can take the disassmbled Clojure compile and make the Java contribution from that to make them equally good, haha.)

Gkodkod · 2025-01-03T16:02:58Z

@PEZ Excited to catch up next Sunday and looking forward to having more fun face-to-face with @bddicken! I'm a bit under the weather at the moment but hope to feel better by then. In the meantime, I’ll leave some comments here.

I’m also catching up on over 140 new updates from others (just fetched)—it’s heartwarming to see so much progress! I also realized I have some code on my branch that I forgot to check in for your review.

I absolutely love what you’ve done so far—you’ve practically "read my mind" regarding Java. I saved your optimized Clojure code and have been reflecting on it further.

As for Native and PGO, I’ve come up with this idea (apologies, my bed is calling, so I couldn’t prepare a full PR). Perhaps you could give it a try? I’m curious if it could further improve our results:

package jvm;

import java.util.concurrent.TimeUnit;
import jdk.incubator.foreign.MemorySegment;
import jdk.incubator.foreign.ValueLayout;

/**
 * Optimization Techniques:
 * Used @CompilerHint annotation to provide optimization hints
 * Added warm-up phase to help JIT compiler
 * Included performance timing
 * Maintained original recursive approach
 */

public class code {
    // Inline-friendly recursive Fibonacci implementation
    private static int fibonacci(int n) {
        if (n <= 1) return n;
        return fibonacci(n - 1) + fibonacci(n - 2);
    }

    // JIT and Native Image optimization hint
    @CompilerHint(inline = true)
    private static long computeFibonacciSum(int upperBound) {
        long result = 0;
        for (int i = 1; i < upperBound; i++) {
            result += fibonacci(i);
        }
        return result;
    }

    public static void main(String[] args) {
        // Input validation
        if (args.length == 0) {
            System.err.println("Please provide an upper bound as an argument.");
            System.exit(1);
        }

        try {
            int upperBound = Integer.parseInt(args[0]);
            
            // Warm-up phase for PGO
            for (int warmup = 0; warmup < 1000; warmup++) {
                fibonacci(10);
            }

            // Actual computation with timing
            long startTime = System.nanoTime();
            long result = computeFibonacciSum(upperBound);
            long endTime = System.nanoTime();

            System.out.println("Result: " + result);
            System.out.printf("Computation time: %.3f ms%n", 
                (endTime - startTime) / 1_000_000.0);
        } catch (NumberFormatException e) {
            System.err.println("Invalid input. Please provide a valid integer.");
            System.exit(1);
        }
    }

    // Annotation to help JIT and Native Image optimization
    @interface CompilerHint {
        boolean inline() default false;
    }
}

Bash Script - run.sh

#!/bin/bash
# PGO Native Image
# Run this script from the fibonacci/jvm directory
# Prerequisites: 
# - Install GraalVM
# - Set JAVA_HOME to GraalVM directory

# Compile Java source
javac --enable-preview optimized_code.java

# Generate profiling data
java -XX:+UnlockDiagnosticVMOptions \
     -XX:+TraceClassLoading \
     -XX:+LogCompilation \
     -XX:LogFile=fibonacci.log \
     optimized_code 30

# Create native image with PGO
native-image \
    --enable-preview \
    -H:Name=fibonacci \
    -H:+ReportExceptionStackTraces \
    --initialize-at-build-time \
    --no-fallback \
    optimized_code

# Run native executable
./optimized_code 30

Thank you for all the amazing progress and have a lovely weekend !

PEZ · 2025-01-19T17:28:23Z

Here's a PR aimed at fixing this issue:

Programs do the benching #365

It is a start, with reference implementations of tooling and all benchmarks for Clojure, Java, and C. I'm thinking that we probably can crowd source the remaining transition reasonably fast.

This was referenced Dec 6, 2024

Compute fib(100) instead of fib(40) #200

Closed

Add Clojure and Babashka (both loops and fibonacci) #162

Merged

This was referenced Dec 11, 2024

Add hello-world benchmark #231

Merged

Add levensthtein for Clojure and Babashka (and Java for good measure) #234

Merged

PEZ added a commit that referenced this issue Jan 2, 2025

Merge pull request #318 from PEZ/java-pgo

5cec8cc

Java: Enable native-image PGO compile as mentioned in #220 comments

bddicken mentioned this issue Jan 10, 2025

[Major Bug/Skill Issue] Incorrect Benchmark Methodology - time Command Misrepresents Java/Julia/etc Performance #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The benchmark penalizes languages that need to start a runtime environment (e.g. anything on the JVM or .Net) #220

The benchmark penalizes languages that need to start a runtime environment (e.g. anything on the JVM or .Net) #220

PEZ commented Dec 6, 2024

PEZ commented Dec 6, 2024

AndrewBabbitt97 commented Dec 19, 2024 •

edited

Loading

PEZ commented Dec 19, 2024

giordano commented Dec 21, 2024 •

edited

Loading

Gkodkod commented Dec 31, 2024 •

edited

Loading

PEZ commented Dec 31, 2024

Gkodkod commented Jan 1, 2025 •

edited

Loading

PEZ commented Jan 1, 2025

PEZ commented Jan 1, 2025

He-Pin commented Jan 1, 2025

PEZ commented Jan 1, 2025

Gkodkod commented Jan 2, 2025 •

edited

Loading

Gkodkod commented Jan 2, 2025 •

edited

Loading

PEZ commented Jan 2, 2025

PEZ commented Jan 2, 2025

Gkodkod commented Jan 3, 2025

PEZ commented Jan 19, 2025

The benchmark penalizes languages that need to start a runtime environment (e.g. anything on the JVM or .Net) #220

The benchmark penalizes languages that need to start a runtime environment (e.g. anything on the JVM or .Net) #220

Comments

PEZ commented Dec 6, 2024

PEZ commented Dec 6, 2024

AndrewBabbitt97 commented Dec 19, 2024 • edited Loading

PEZ commented Dec 19, 2024

giordano commented Dec 21, 2024 • edited Loading

Gkodkod commented Dec 31, 2024 • edited Loading

PEZ commented Dec 31, 2024

Gkodkod commented Jan 1, 2025 • edited Loading

PEZ commented Jan 1, 2025

PEZ commented Jan 1, 2025

He-Pin commented Jan 1, 2025

PEZ commented Jan 1, 2025

Gkodkod commented Jan 2, 2025 • edited Loading

Gkodkod commented Jan 2, 2025 • edited Loading

PEZ commented Jan 2, 2025

PEZ commented Jan 2, 2025

Gkodkod commented Jan 3, 2025

PEZ commented Jan 19, 2025

AndrewBabbitt97 commented Dec 19, 2024 •

edited

Loading

giordano commented Dec 21, 2024 •

edited

Loading

Gkodkod commented Dec 31, 2024 •

edited

Loading

Gkodkod commented Jan 1, 2025 •

edited

Loading

Gkodkod commented Jan 2, 2025 •

edited

Loading

Gkodkod commented Jan 2, 2025 •

edited

Loading