-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The benchmark penalizes languages that need to start a runtime environment (e.g. anything on the JVM or .Net) #220
Comments
See also: Where there is some more discussion and benchmark results for Clojure and Java. |
TBH for .NET the default and only option should just be NativeAOT with optimizations set for speed. For any of these kind of benchmarks running non-aot is just putting a handicap on the language / runtime. |
@AndrewBabbitt97 I'm beginning to think the same. |
The same applies to Julia, which is a just-in-time compiled language: you're mainly measuring startup time, the time to run the function is a fraction of it. I slightly modified the code to be able to measure runtime: diff --git a/levenshtein/julia/code.jl b/levenshtein/julia/code.jl
index 04ab029..af88189 100644
--- a/levenshtein/julia/code.jl
+++ b/levenshtein/julia/code.jl
@@ -45,8 +45,8 @@ function levenshtein_distance(s1::String, s2::String)::Int
return prev_row[m + 1]
end
-function main()
- if length(ARGS) < 2
+function main(strings::Vector{String})
+ if length(strings) < 2
println("Please provide at least two strings as arguments.")
exit(1)
end
@@ -55,10 +55,10 @@ function main()
times = 0
# Compare all pairs of strings
- for i in 1:length(ARGS)
- for j in 1:length(ARGS)
+ for i in eachindex(strings)
+ for j in eachindex(strings)
if i != j
- distance = levenshtein_distance(ARGS[i], ARGS[j])
+ distance = levenshtein_distance(strings[i], strings[j])
if min_distance == -1 || distance < min_distance
min_distance = distance
end
@@ -67,11 +67,12 @@ function main()
end
end
- println("times: ", times)
- println("min_distance: ", min_distance)
+ return times, min_distance
end
# Run main only if script is run directly
if abspath(PROGRAM_FILE) == @__FILE__
- main()
+ times, min_distance = main(ARGS)
+ println("times: ", times)
+ println("min_distance: ", min_distance)
end then what you're measuring is $ input=($(cat ../input.txt))
$ cmd="julia code.jl ${input[@]}"
$ hyperfine --shell=none "${cmd}"
Benchmark 1: julia code.jl aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb cccccccccccccccccccccccccccccccccccccccccccccccccccc ddddddddddddddd eeeeeeeeeeeeeeeee ffffffffffffff ggggggggggg hhhhhhhhhhhhhhhhhhhhhhhhhhhhh iiiiiiiiiiiiiiiiiiiiiiiiiiiii jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk llllllllllllllllllllllllllllllllllllllllllllll mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ooooooooooooooooooooooooooooooooo pppppppppppppppppppppppppppppppppppppppppp qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq rrrrrrrrrrrrrrrrrrrrrrrrrrrrr ssssssssssssssssssssssssssssss tttttttttttttttttttttttttt uuuuuuuuuuuuuuuuuuuuuuuuu vvvvvvvvvvvvvvvvvvvvvvvvvvvvv wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm 1234567890 QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890 mnxbckfjquwhduywetdftfqrcfgzbxmcsdlsppweuouhftvjcsmnbsdvuqtfvcjbxmcnbkaoqwerpweiurigjhdfjbmsbjgqiuyeiruyoisbnsvuyfguweygrhoaiosjdkjhfbsjhdvfkqjnoi gaywgeudgaewytfvajsdvfjahsdvfjhsdfjhgsfjhqowiueryiuweytruiyytfyqvhgvzxncbsidfjdflsdfpwoeiriuwheuyfvydtvqyuygbhskdjfbkajbuywgduywegfvsnmwehriygwuerygwerdfjhsdfsdkfhbsjdhfjuavufywefhuwlihfoqwehqrwheiruyuyqgruwieqrhsdjhfbjwhvdfisdufhikquhweifgbwfisdhfksjdhfiwuh ASDFHWERFIWSHHSDFJWERPWFOAHDSDFXMCVNWMERKJQEOQIWRUQIHJSBFSDFKJSDFOWIERIOU werkjhiauwryisydgcfkjsdfkjsjdhfguiyqgtfsdfSDGFDSFGwsrhqwigdeyiwefDFBdkqedfgdFHGHKJiFHTYRGFsefjhwhsgdfjhhjsdfDWEfsdfWEFfbgFGjYuioIOpvbnVBNSDFadfsSDFwegsdgfAWFDFGDFghjTyIGHJREGFsddqwdsdfweaWQAZZSDGgnlpHmHJMgkOYUTDFGSFSDFSDFDHGDFGSDFDGRFjhbjshdgfjhgsdfSDGDFG kjsdfkhgwieyfguwsyefgsdfSDFGJSDAKFDSAFIRFYUDSFHSBVXCVBNMSDKFJWOYSDFHKADSDnfsdfjbjshdbfkwjbfksdbfhwbkyguygyfshbcdmnxbvcmnsdkfsdfsdflspflwoekorwueiruygwuygjshbvnbvzcjsuhfdiuhsdfghkwjhdfiwegfjdhsgfbksjhfksdhgfhsdhfghfgfgfsjdjfhkwjehoueuq abcdefghijklmnopqrstuvwxyz thistringisnottoolong howcanwemakeabetterstringforcomparing absfeiihdkkeiurmnslkspoiqwrjlnjna jsfjlqpowiuewriugbsdmfasgdfqsfwwiewurmxbcaslkdjp qwerdsdftffygfuhbuhiok sfhdgafsdhafsdjaweqwueytuqwyefgagsdgfasdfajfgsieufowerpoafposdfmnxbcvjlkhhgdfkjasfguysdfuayfgjhvf ygiuwoqupqowei sjxvkzjha hfgauyiuyrwer sdf dsfgsdfdfg dfgdswefdsdgd ffhfghsdfghsdfeuryiwuyeriudhfmnsbmnbvsgdfkuweifgwefgisdgf sdfskdfhgksugfweriuwersdkfjbsewoir sfdqrieyft sdfnueyrtuwyerowierpoipmvmcxnvmnbsdnfbajeygruywgefugsdbaqwriuweuywiuyer dsfjgfuywegfugsufgqfgdjsdhfjhsdgfhsgdfiweoruoweruwer sdfhyairuywoerfuetuopiufudsvxcbvmsdflksfwpeoriiueief dsfygweurytwueyr sdfshfgwegfisfisdgfowuyeridfs sdfhsdfgsudyfguwyef sdfjsgdfysgdfuwgueyfs dfsgdfusgdfiweoriuwqoiru ygdifuwuoerupweurpquerowehmnxcbvmnxbvjcquqyerypeiproweur fhsdgfygsdfuygsudfgufrpesifgsdfshdfbuwyveugfud fefgiuyegsiufygdfigreoquyarcfdbnmnsbfdlsdfoiweprisdfqbyefg sdfugwuefgusdygfuygwdufytwuqytueryt efsgdvfuysgfdusydfuywguyeryoqurwreiueieieiiii
Time (mean ± σ): 192.0 ms ± 7.8 ms [User: 369.2 ms, System: 30.7 ms] but the actual run time spent in the
Edit: just noticed this issue overlaps with #149. |
Hey @PEZ In regards to your Linkedin post you wrote today and our conversation there:
Let me try to explain... and suggest ideas to try The core problem is likely related to how the C2 JIT compiler performs Escape Analysis during the Profile-Guided Optimization (PGO) process:
The key observation is the inconsistent behavior across multiple runs, where performance suddenly degrades after a few iterations. Not sure if my explanation is clear enough, yet I got some ideas for you. I'll provide a few alternative Clojure Fibonacci implementations that might help circumvent the Escape Analysis issue:
Pros:
Cons:
Pros:
Cons:
Pros:
Cons:
Pros:
Cons:
These impl suggestions, of course, need to be highly benchmarked, using hyperfine like all other languages. Use primitive-type hints/arithmetic while minimizing objs creation and preferring tail recursion or iteration. Also, they make escape analysis "obvious," reducing GC pressure while making it easier for JIT to optimize. They should work well with with Profile-Guided Optimization, I hope. Let @bddicken and I know your findings, Master and a most Happiest New Years ! |
Thanks, @Gkodkod ❤️ I have spent a significant amount of time in this repo rejecting contributions that go for, memoization, tail recursion, and iteration, since it is not in the spirit of the benchmark which seeks to measure a language's function call overhead performance. So we can't really go there with this one. As for the primitive focused recursive approach, I think that's largely what we do already: (defn- fibonacci ^long [^long n]
((fn fib ^long [^long n]
(if (<= n 1)
n
(+ (long (fib (- n 1)))
(long (fib (- n 2)))))) n)) Not that we actually recurse in a named lambda. This is because using a namespaced function for it, makes the Clojure compiler produce a different kind of function call which hurts performance badly. I think that it is something with this “private” function where C2 goes wrong. The type hints and casts are enough for the Clojure compiler to understand that it should do unchecked integer math. The above function, when compiled, decompiles to this: // Decompiling class: code$fibonacci$fib__10924
import clojure.lang.*;
public final class code$fibonacci$fib__10924 extends AFunction implements LL
{
@Override
public final long invokePrim(final long n) {
return (n <= 1L) ? n : (RT.longCast(this.invoke(Numbers.num(Numbers.unchecked_minus(n, 1L)))) + RT.longCast(this.invoke(Numbers.num(Numbers.unchecked_minus(n, 2L)))));
}
@Override
public Object invoke(final Object o) {
return ((IFn.LL)this).invokePrim(RT.uncheckedLongCast(o));
}
} As far as I know that's as good as it gets. And we are looking at the benchmark winning code, which I think agrees with that assessment. I've also tried using a |
@PEZ Thank you for explaining your thoughts about this so eloquently. I also appreciate the produced decompiled Java code with your winning code. With that being said, allow me to compare this code to my suggested Primitive-Focused Recursive implementation with regards to performance differences, and of course, you can decide which way you want to go. Current Code:
My Suggested Implementation:
Key Differences: 1. Nested Anonymous Function vs Direct Recursion
2. Explicit Type Casting
3. Condition Check
4. Arithmetic Operations
5. Complexity
The performance implications between the two are slight and might make a difference to the decompiled Java code:
My suggested implementation is likely to be more friendly to Escape Analysis because it could potentially: produce fewer function calls, less explicit type casting, have more direct object lifetime, and uses unchecked- operations which hint at primitive behavior. Again, it is your call to make if you would like to try testing it. I would love to see if it improved the actual test results and the decompiled Java including the main as can further analyze them both. I wish had more time to spend on the startup time issues you've mentioned, and actually had some ideas, yet I cannot offer any advice without clearly understanding the issues from you. It is a brillant idea to include it in your latest benchmark visualization. Additionally, allow me to once more thank you for the most excellent time and work you've put into the whole repo and especially the most brillant benchmark visualization repo you put together. It's so amazing and helpful for so many!!! ❤️❤️❤️ Gott nytt år till dig och hela familjen! ❤️❤️❤️ |
Thanks again! It's a new year here now. 🎉 The problem with using direct recursion is that it generates a different, less performant, way of doing the function call. Here are benchmarks: Nested anonymous function
Direct recursion
We can see it in a decompile: public final class code$fibonacci extends AFunction implements LO
{
public static final Var __code_fibonacci;
public static Object invokeStatic(final long n) {
return (n < 2L) ? Numbers.num(RT.longCast(n)) : Numbers.unchecked_add(((LO)__code_fibonacci.getRawRoot()).invokePrim(n - 1L), ((LO)__code_fibonacci.getRawRoot()).invokePrim(n - 1L - 1L));
}
@Override
public Object invoke(final Object o) {
return invokeStatic(RT.uncheckedLongCast(o));
}
@Override
public final Object invokePrim(final long n) {
return invokeStatic(n);
}
static {
__code_fibonacci = RT.var("code", "fibonacci");
}
} It's those However the calculations do look more straight forward so I will now try to apply those to the nested lambda version. |
Found a solution. It ain't pretty. 😄 (ns code
(:gen-class))
(set! *unchecked-math* :warn-on-boxed)
(definterface IFib
(^long fib [^long n]))
(deftype Fibonacci []
IFib
(fib [_ n]
(if (or (zero? n)
(== 1 n))
(long n)
(long (+ (.fib _ (- n 1))
(.fib _ (- n 2)))))))
(def ^:private ^Fibonacci fibonacci (Fibonacci.))
(defn -main [& args]
(let [u (long (parse-long (first args)))
r (loop [i 1
sum 0]
(if (< i u)
(recur (inc i) (+ sum (long (.fib fibonacci i))))
sum))]
(println r))) The decompile of the public final class Fibonacci implements IFib, IType
{
public static IPersistentVector getBasis() {
return Tuple.create();
}
public long fib(final long n) {
final boolean or__5600__auto__12362 = Numbers.isZero(n);
return (or__5600__auto__12362 ? or__5600__auto__12362 : Numbers.equiv(1L, n)) ? n : (((IFib)this).fib(n - 1L) + ((IFib)this).fib(n - 2L));
}
} And then PGO can be used without fear of detoriation. And it makes a significant difference: |
scala-native/scala-native#4132 Clojure rocks, this project do get some interesting tide on performance. |
It also seems my M1 loves Clojure extra much (not strange, I've been coding Clojure a lot on it for three years). On M4 Clojure does not win, even if it still performs very well. |
Happy New Years @bddicken, @PEZ and congratulations on the Clojure optimization. Love your persistence, and so awesome of you!!! It's very interesting, and I believe we should try to optimize the Java, Scala, C, C++, and other languages to the same degree, if possible. This will truly be descriptive of the concept that @bddicken originally created, rather than most of the stress on 1 language. It's also thought-provoking that some perform with very similar results. In terms of the JVM start times, I've given it some thought after seeing your comment above, which missed yesterday. The differences in startup times between languages like C and VM-hosted ones like Java or Clojure can really skew the benchmarks, especially for something like "Hello, World!" where the startup overhead dominates. It’s frustrating because, in the real world, that startup cost often doesn’t matter for long-running tasks, but it’s huge for short-lived ones, like scripts. I love that you’re looking into ways to make the benchmarks fairer. Personally, I think it’s worth exploring. There’s so much to unpack here. For example: 1. Including a startup benchmark and subtracting it: I like this idea—it makes sense for short-lived tasks. But startup time isn’t always constant (e.g., class loading and initialization complexity can vary), so it’s worth testing how accurate this adjustment would actually be. 2. Using tools that run inside the JVM: For Java and Clojure, running the benchmarks within the JVM, like with JMH (Java Microbenchmark Harness), is a great idea. It would give a much clearer picture of their steady-state performance without startup times muddying the waters. Of course, it doesn’t help much if startup time is what you’re measuring. Tools like JMH tackle JVM warm-up effects but don’t directly address startup time differences. For Java and Clojure, running benchmarks within the JVM (e.g., using JMH) provides a more realistic view of performance by excluding startup costs. However, this shifts the focus away from startup time, which is sometimes the very thing you’re trying to measure. 3. Running multiple iterations: This is probably my favorite approach. If you add a second argument to run Fibonacci 40, say, 1000 times, startup time becomes a much smaller factor. It’s a simple way to make the results more reflective of long-running use cases. For example, running Fibonacci 40 a million times will make startup time almost negligible. It reflects the typical usage pattern for long-running applications. 4. Extending loops or using more complex workloads: Similar to the above, this reduces the startup impact, but we'd need to be careful about non-linear algorithms like Fibonacci since they can skew the results in other ways. This could disproportionately favor certain languages. Maybe adding some real-world scenarios, like computational pipelines, file processing, or network I/O, could balance things out, but then we are entering another realm of library implementations of such scenarios. I'm actually working on scaling and optimizing computational pipelines in Spark getting lots of headaches, and thank God for coffee ! 😄 Honestly, your work here is super valuable. The tech community—especially places like Reddit and others—has been calling out these benchmarking quirks for years, and it’s awesome to see someone taking it seriously. If you’re planning on submitting a PR, I think you’d get a lot of support. Just be sure to document your approach clearly and explain when adjustments (like subtracting startup times) are appropriate and when they’re not. At the end of the day, startup time matters a lot for things like scripting, and tools like Babashka really shine there (1st for me to hear of it and sounds like something we'd call our grandma or use it as a sweet nick nama, or is that Babushka.) 😄 For long-running processes, it’s a completely different story. These benchmarks could help show the full picture and make these comparisons more meaningful for different use cases. Exploring ways to make benchmarks more representative of real-world use cases is quite valuable, and perhaps we should explore that as well. Again, love the visuals!!! ❤️ WDYT ? Ben ? |
If you’re open to suggestions, here’s a plan of action that might help make the benchmarks even more meaningful: 1. Benchmark Categories: Split the benchmarks into startup-focused and execution-focused groups to reflect different scenarios. This way, you can highlight where each language shines (or struggles). 2. Report Breakdown: Include startup time as a separate metric alongside execution time for VM-hosted languages. This would give readers a clearer picture of what’s really happening, especially for short-lived vs. long-running tasks. 3. Collaborate with the Community: Platforms like Reddit, and Hacker News are great for bouncing ideas off others. They often uncover edge cases or optimizations we might overlook. I’ve seen this firsthand with a performance benchmark comparing languages for concurrent task handling originally created by a European PHD and then taken over by a .Net anonymous enthusiast. It got a lot of attention but also backlash because the results were flawed, and the developer resisted collaboration. The situation could’ve been a great opportunity to learn and improve, but instead, it became contentious. Here’s the link if you’re curious: Hacker News Discussion. 4. Use Real-World Workloads: Beyond "Hello, World!" or Fibonacci, include tasks like file I/O, network requests, or concurrent processing. These reflect scenarios developers actually encounter. BTW, thanks so much for testing my suggestion yesterday... I'm stunned by what you came up with today... Kudus and WOW on the OOP version !!! I think these steps could really elevate the benchmarks and make them valuable to a wide audience, especially in ML and data science as well. Would love to see where we take this. |
To clarify my changes. They were only needed because the Clojure compiler happened to hit a problem with C2 Escape Analysis. I don't think any other language has a similar problem, so there are no similar work needed for them. For any language that we compile with GraalVM native-image, we can easily add PGO. I can do that for Java, but for Scala it is a bit too opaque for me how the native compile is done, so someone with more knowledge about Scala than I have needs to have a look. Since my experiment with subtracting start times failed so horribly, I really don't see a path forward there. As I can't spend the time to get language-specific benchmarking tools up and running for each language, and I don't have the patient to wait much longer for the benchmarks to run, my current approach has been to go for each language's native compile. It brings the start time problem down significantly. In that effort I have investigated if we can do native builds for Julia, but I don't think we can. It can package standalone executables, but they will not be native compiles, like the JVM and .Net builds are. I have also tried, but not yet succeeded in compiling C# and F# with AOT on my machine. As for reporting start time separately, that's what the hello-world benchmark is about. I think just publishing a few reports with hello-word results will help. I'm considering removing the start-time-mode from my visualizer and just add hello-world side by side with the other benchmarks. |
Java: Enable native-image PGO compile as mentioned in #220 comments
@PEZ Excited to catch up next Sunday and looking forward to having more fun face-to-face with @bddicken! I'm a bit under the weather at the moment but hope to feel better by then. In the meantime, I’ll leave some comments here. I’m also catching up on over 140 new updates from others (just fetched)—it’s heartwarming to see so much progress! I also realized I have some code on my branch that I forgot to check in for your review. I absolutely love what you’ve done so far—you’ve practically "read my mind" regarding Java. I saved your optimized Clojure code and have been reflecting on it further. As for Native and PGO, I’ve come up with this idea (apologies, my bed is calling, so I couldn’t prepare a full PR). Perhaps you could give it a try? I’m curious if it could further improve our results:
Bash Script - run.sh
Thank you for all the amazing progress and have a lovely weekend ! |
Here's a PR aimed at fixing this issue: It is a start, with reference implementations of tooling and all benchmarks for Clojure, Java, and C. I'm thinking that we probably can crowd source the remaining transition reasonably fast. |
Hi,
Thanks for creating this project, @bddicken! 🙏
I have been running benchmarks in some different ways the last few days. I notice that a lot of the difference between the Java performance and C is about JVM start times. And for Clojure the penalty is much bigger, because then also Clojure needs to start before the benchmarks can start crunching. To “see” the start time I created a
hello-world
benchmark, where the programs only print “Hello World!”. Here's a run on my Mac M4:Basically the JVM takes 40ms to start, and then Clojure takes another 240ms. To be compared with the benchmark times for e.g. C, Java, and Clojure, which run at 250, 350, and 550 ms, respectively on the same machine.
I can submit the hello-world benchmark as a PR, if you're interested. Start times can be very important for e.g. shell scripting purposes and such. In the Clojure world we have Babashka which is interpreted Clojure. It starts super quickly, but of course can't compete with compiled languages in the other benchmarks.
Some questions:
x
times, and with a significantly highx
, the startup times would play much lesser role.WDYT?
The text was updated successfully, but these errors were encountered: