-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace entire suite with one that has already been golfed to death #157
Comments
there's also Dave Plummer's prime sieve benchmark |
Thanks for the feedback @headius. I'm sure you know far more about benchmarking and compilers than I do, so it means a lot to get your input like this. I'll reply to a few of the things you brought up:
To be honest, this kinda snowballed (as you might have seen on twitter). I started by getting just a modest number of PRs, and then after I published are round of results I got over 100 and then things just kinda kept rolling. I actually think it's pretty cool to be able to crowdsource a language comparison like this, even if it isn't perfectly academic, which it certainly is not (I worked in academia for 7 years!).
We actually do warmup runs and multiple timed executions, using hyperfine. If you have Ideas for how to configure hyperfine differently, please make a PR.
I don't necessarily view this as a bad thing. There are times where this "doesn't matter." For example, in a long-running process like a webapp server designed to stay running for weeks or months in a row. However, there are real-world scenarios where the cost of startup time matters. For example, a job that gets repeatedly spun up via a cron job a million times per day. Including startup time doesn't make a comparison invalid, it just has to be understood what is being tested.
I'm curious, how much work has this caused? |
One such case are Lambda serverless functions. You have most benefits with executables that have fast startup time. |
Hyperfine does nothing to warm up the actual language runtimes. It only warms up the OS by ensuring that filesystem caches have loaded that runtime and the code it will run. Many (I would say most) non-precompiled languages depend on runtime optimization of code within one process. They need time to profile code, optimize it, JIT-compile it to native code, and tune other aspects at runtime like GC heap size and strategies. Hyperfine does nothing to help this situation, and by running benchmarks once, in a cold process, you see almost none of those runtime optimizations.
Then your benchmark is not a language performance benchmark, it's a language runtime startup benchmark being marketed as a language performance benchmark. An argument made by the Benchmarks Game folks goes like this: if it runs long enough, the cost of startup should fade away because runtime optimizations have time to run, and indeed that suite also does not do any in-process warmup before taking measurements. But the timing of a full process, including the warmup and optimization phase, will always penalize languages that are more dynamic and require runtime optimizations to reach the speed of pre-optimized code. As a result it is a useless metric for measuring the real-world, steady state performance of an language and its runtime. Java JIT compilers can easily match or exceed the performance of compiled C or C++ code. They do this by profiling and optimizing at runtime, and rarely achieve maximum performance in the first couple minutes of execution.
When I first replied to your posts, I'd had one question from a Ruby community member about why JRuby didn't perform well and how to fix it so that it performed the way it ought to (for a blog post about the Ruby performance numbers). Since then I've heard from four other people who needed the same explanations about the flaws in this suite. I expect to be answering these questions for years because of the sudden interest in this suite. Every time one of these meaningless benchmarks go viral, I have to steel myself for years of answering these same questions.
Are there many Lambda serverless functions running useless loops or calculating fibonacci numbers? And what percentage of the world's computation is done by Lambda-style deployments versus long-running servers handling millions of requests per day? This argument is not unreasonable, but Lambda-style "serverless" deployments are already tailored to languages and runtimes that can start up quickly and are not penalized by restarting. That basically removes every dynamic language and every JVM language from contention, since they almost all depend on runtime profiling to achieve their best performance. This benchmark suite has been sold as a comparison of language performance, not a measure of language runtime startup performance. If it were sold as the latter I wouldn't care, and I wouldn't be answering questions every other day about why Ruby implementations (JRuby in particular, which is both a dynamic language and a JVM language) don't perform as well as Rust or Go or Java with ahead-of-time compilation (which is a valid case for comparing startup, but not for comparing language performance).
And yet you didn't do any research to see if there's other crowdsourced, cross-language benchmark suites? I have to ask the question: given that this suite does not contain any real-world representative benchmarks, and that it heavily favors languages that optimize once at compile time rather than dynamically at runtime, and that you know it isn't "perfectly academic", then why keep it up? There's better collections of benchmarks (many of them heavily crowd-sourced and testing real-worldish workloads), with better harnesses (many tailored to actually compare performance in a long-running process). Why create yet another benchmark suite that has to spend years in a debunk cycle? I don't blame you for this going viral, but what's done is done. The "academic" thing to do would be to delete or otherwise add a note about its inaccuracy with links to issues like this one that describe the problems. Your visualization is novel... I would like to see that applied to other suites. But to continue promoting this project as a "language performance" comparison is academically dishonest at best. |
And colorful animations are always a good way to attract attention. And that attention grabbing probably makes it more difficult to compare or even just read the measurements. |
Well we do try to find examples where in-process JMH measurements are obviously much much faster for the benchmarks game tiny tiny programs and we don't succeed. Presumably I'm doing it wrong. |
@igouy For Java benchmarks of that length, this isn't too surprising. Baseline JVM startup should be well under a second, and none of those benchmarks are large enough to cost must time to load and start. Java will usually optimize within the first couple of seconds for such small code. For heavier runtimes like JRuby, however, the effect would be much more pronounced. We often take a couple of seconds just to start up, and the early iterations will always be slower than later ones, due to the "two tier JIT" combination of JRuby and JVM. For a benchmark that takes a few seconds, three to five iterations will be needed to reach steady state in most cases. |
0.0033 mean 55.3357 mean 52.561 The 1st and 5th measurement were the only ones below 54s. |
I appreciate the discussion here, but I do not intend to replace the benchmark at this time. Happy to field other ideas about ways to improve, new programs we could use for comparisons, etc. |
@bddicken Just use the benchmarks from benchmarks game. They've been golfed and refined for decades. |
Isn't it ordinary to re-invent the round-wheel hexagonal? |
I do not believe this suite provides anything better than the many existing benchmark suites out there. There's currently two benchmarks: loops and fibonacci. Neither provides any useful performance metric for any tested language. The benchmark harness itself just runs each implementation once from a command line, which prevents many optimizing runtimes from optimizing any code, and includes startup time for all runtimes.
There are many other benchmark suites that have already iterated over similar benchmarks (plus include several that do real work), such as the Benchmarks Game: https://benchmarksgame-team.pages.debian.net/benchmarksgame/index.html
Individual language implementations also frequently have more complete suites of representative microbenchmarks, such as for PyPy and Ruby's YJIT
And there are many existing harnesses for benchmarks that can deliver more meaningful results. They do things like run multiple times, run repeatedly in a single process, and isolate the execution of the runtime and the benchmark harness from timing results.
By having yet another microbenchmark suite that doesn't measure anything useful and does it in the most inaccurate way, this repository is creating work for language and runtime implementers that now have to answer questions about "why aren't you fast" and submit PRs to fix benchmarks that are useless to begin with.
I would recommend replacing this suite with one that has valuable benchmarks run in a meaningful way, or else just delete it.
The text was updated successfully, but these errors were encountered: