Replace entire suite with one that has already been golfed to death #157

headius · 2024-12-03T20:21:16Z

I do not believe this suite provides anything better than the many existing benchmark suites out there. There's currently two benchmarks: loops and fibonacci. Neither provides any useful performance metric for any tested language. The benchmark harness itself just runs each implementation once from a command line, which prevents many optimizing runtimes from optimizing any code, and includes startup time for all runtimes.

There are many other benchmark suites that have already iterated over similar benchmarks (plus include several that do real work), such as the Benchmarks Game: https://benchmarksgame-team.pages.debian.net/benchmarksgame/index.html

Individual language implementations also frequently have more complete suites of representative microbenchmarks, such as for PyPy and Ruby's YJIT

And there are many existing harnesses for benchmarks that can deliver more meaningful results. They do things like run multiple times, run repeatedly in a single process, and isolate the execution of the runtime and the benchmark harness from timing results.

By having yet another microbenchmark suite that doesn't measure anything useful and does it in the most inaccurate way, this repository is creating work for language and runtime implementers that now have to answer questions about "why aren't you fast" and submit PRs to fix benchmarks that are useless to begin with.

I would recommend replacing this suite with one that has valuable benchmarks run in a meaningful way, or else just delete it.

briansterle · 2024-12-03T20:50:26Z

there's also Dave Plummer's prime sieve benchmark

bddicken · 2024-12-06T21:01:41Z

Thanks for the feedback @headius. I'm sure you know far more about benchmarking and compilers than I do, so it means a lot to get your input like this. I'll reply to a few of the things you brought up:

I do not believe this suite provides anything better than the many existing benchmark suites out there.

To be honest, this kinda snowballed (as you might have seen on twitter). I started by getting just a modest number of PRs, and then after I published are round of results I got over 100 and then things just kinda kept rolling. I actually think it's pretty cool to be able to crowdsource a language comparison like this, even if it isn't perfectly academic, which it certainly is not (I worked in academia for 7 years!).

The benchmark harness itself just runs each implementation once from a command line,

We actually do warmup runs and multiple timed executions, using hyperfine. If you have Ideas for how to configure hyperfine differently, please make a PR.

and includes startup time for all runtimes.

I don't necessarily view this as a bad thing. There are times where this "doesn't matter." For example, in a long-running process like a webapp server designed to stay running for weeks or months in a row. However, there are real-world scenarios where the cost of startup time matters. For example, a job that gets repeatedly spun up via a cron job a million times per day. Including startup time doesn't make a comparison invalid, it just has to be understood what is being tested.

this repository is creating work for language and runtime implementers that now have to answer questions about "why aren't you fast" and submit PRs to fix benchmarks that are useless to begin with.

I'm curious, how much work has this caused?

michele-comitini · 2024-12-08T01:11:25Z

and includes startup time for all runtimes.

I don't necessarily view this as a bad thing. There are times where this "doesn't matter." For example, in a long-running process like a webapp server designed to stay running for weeks or months in a row. However, there are real-world scenarios where the cost of startup time matters. For example, a job that gets repeatedly spun up via a cron job a million times per day. Including startup time doesn't make a comparison invalid, it just has to be understood what is being tested.

One such case are Lambda serverless functions. You have most benefits with executables that have fast startup time.
BTW it would be nice also to show memory usage.

headius · 2024-12-09T10:19:50Z

We actually do warmup runs and multiple timed executions, using hyperfine

Hyperfine does nothing to warm up the actual language runtimes. It only warms up the OS by ensuring that filesystem caches have loaded that runtime and the code it will run. Many (I would say most) non-precompiled languages depend on runtime optimization of code within one process. They need time to profile code, optimize it, JIT-compile it to native code, and tune other aspects at runtime like GC heap size and strategies. Hyperfine does nothing to help this situation, and by running benchmarks once, in a cold process, you see almost none of those runtime optimizations.

However, there are real-world scenarios where the cost of startup time matters.

Then your benchmark is not a language performance benchmark, it's a language runtime startup benchmark being marketed as a language performance benchmark.

An argument made by the Benchmarks Game folks goes like this: if it runs long enough, the cost of startup should fade away because runtime optimizations have time to run, and indeed that suite also does not do any in-process warmup before taking measurements. But the timing of a full process, including the warmup and optimization phase, will always penalize languages that are more dynamic and require runtime optimizations to reach the speed of pre-optimized code. As a result it is a useless metric for measuring the real-world, steady state performance of an language and its runtime.

Java JIT compilers can easily match or exceed the performance of compiled C or C++ code. They do this by profiling and optimizing at runtime, and rarely achieve maximum performance in the first couple minutes of execution.

I'm curious, how much work has this caused?

When I first replied to your posts, I'd had one question from a Ruby community member about why JRuby didn't perform well and how to fix it so that it performed the way it ought to (for a blog post about the Ruby performance numbers). Since then I've heard from four other people who needed the same explanations about the flaws in this suite. I expect to be answering these questions for years because of the sudden interest in this suite.

Every time one of these meaningless benchmarks go viral, I have to steel myself for years of answering these same questions.

One such case are Lambda serverless functions

Are there many Lambda serverless functions running useless loops or calculating fibonacci numbers? And what percentage of the world's computation is done by Lambda-style deployments versus long-running servers handling millions of requests per day?

This argument is not unreasonable, but Lambda-style "serverless" deployments are already tailored to languages and runtimes that can start up quickly and are not penalized by restarting. That basically removes every dynamic language and every JVM language from contention, since they almost all depend on runtime profiling to achieve their best performance.

This benchmark suite has been sold as a comparison of language performance, not a measure of language runtime startup performance. If it were sold as the latter I wouldn't care, and I wouldn't be answering questions every other day about why Ruby implementations (JRuby in particular, which is both a dynamic language and a JVM language) don't perform as well as Rust or Go or Java with ahead-of-time compilation (which is a valid case for comparing startup, but not for comparing language performance).

I actually think it's pretty cool to be able to crowdsource a language comparison like this

And yet you didn't do any research to see if there's other crowdsourced, cross-language benchmark suites?

I have to ask the question: given that this suite does not contain any real-world representative benchmarks, and that it heavily favors languages that optimize once at compile time rather than dynamically at runtime, and that you know it isn't "perfectly academic", then why keep it up? There's better collections of benchmarks (many of them heavily crowd-sourced and testing real-worldish workloads), with better harnesses (many tailored to actually compare performance in a long-running process). Why create yet another benchmark suite that has to spend years in a debunk cycle?

I don't blame you for this going viral, but what's done is done. The "academic" thing to do would be to delete or otherwise add a note about its inaccuracy with links to issues like this one that describe the problems. Your visualization is novel... I would like to see that applied to other suites. But to continue promoting this project as a "language performance" comparison is academically dishonest at best.

igouy · 2024-12-16T21:47:57Z

Your visualization is novel...

And colorful animations are always a good way to attract attention.

And that attention grabbing probably makes it more difficult to compare or even just read the measurements.

igouy · 2024-12-16T22:45:22Z

@headius

An argument made by the Benchmarks Game folks goes like this: if it runs long enough, the cost of startup should fade away because runtime optimizations have time to run, and indeed that suite also does not do any in-process warmup before taking measurements. But the timing of a full process, including the warmup and optimization phase, will always penalize languages that are more dynamic and require runtime optimizations to reach the speed of pre-optimized code. As a result it is a useless metric for measuring the real-world, steady state performance of an language and its runtime.

Well we do try to find examples where in-process JMH measurements are obviously much much faster for the benchmarks game tiny tiny programs and we don't succeed.

Presumably I'm doing it wrong.

headius · 2024-12-17T02:50:12Z

@igouy For Java benchmarks of that length, this isn't too surprising. Baseline JVM startup should be well under a second, and none of those benchmarks are large enough to cost must time to load and start. Java will usually optimize within the first couple of seconds for such small code.

For heavier runtimes like JRuby, however, the effect would be much more pronounced. We often take a couple of seconds just to start up, and the early iterations will always be slower than later ones, due to the "two tier JIT" combination of JRuby and JVM. For a benchmark that takes a few seconds, three to five iterations will be needed to reach steady state in most cases.

igouy · 2024-12-17T03:03:36Z

@headius

0.0033 mean
0.0011 stdev
0.0020 min

55.3357 mean
0.7113 stdev
50.8950 min

52.561
54.186
54.926
54.926
50.895
55.906
55.003
56.081
54.919
55.805
54.979

The 1st and 5th measurement were the only ones below 54s.

bddicken · 2025-01-10T22:53:03Z

I appreciate the discussion here, but I do not intend to replace the benchmark at this time. Happy to field other ideas about ways to improve, new programs we could use for comparisons, etc.

headius · 2025-01-11T01:42:16Z

@bddicken Just use the benchmarks from benchmarks game. They've been golfed and refined for decades.

igouy · 2025-01-11T19:03:33Z

@headius

Isn't it ordinary to re-invent the round-wheel hexagonal?

bddicken mentioned this issue Dec 11, 2024

Add hello-world benchmark #231

Merged

bddicken closed this as completed Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace entire suite with one that has already been golfed to death #157

Replace entire suite with one that has already been golfed to death #157

headius commented Dec 3, 2024 •

edited

Loading

briansterle commented Dec 3, 2024

bddicken commented Dec 6, 2024

michele-comitini commented Dec 8, 2024

headius commented Dec 9, 2024

igouy commented Dec 16, 2024

igouy commented Dec 16, 2024

headius commented Dec 17, 2024

igouy commented Dec 17, 2024 •

edited

Loading

bddicken commented Jan 10, 2025

headius commented Jan 11, 2025

igouy commented Jan 11, 2025

Replace entire suite with one that has already been golfed to death #157

Replace entire suite with one that has already been golfed to death #157

Comments

headius commented Dec 3, 2024 • edited Loading

briansterle commented Dec 3, 2024

bddicken commented Dec 6, 2024

michele-comitini commented Dec 8, 2024

headius commented Dec 9, 2024

igouy commented Dec 16, 2024

igouy commented Dec 16, 2024

headius commented Dec 17, 2024

igouy commented Dec 17, 2024 • edited Loading

bddicken commented Jan 10, 2025

headius commented Jan 11, 2025

igouy commented Jan 11, 2025

headius commented Dec 3, 2024 •

edited

Loading

igouy commented Dec 17, 2024 •

edited

Loading