-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor benchmark tools for statistical significance #7094
Refactor benchmark tools for statistical significance #7094
Conversation
I'm not sure who to cc for this one. |
@nodejs/benchmarking |
In theory, this sounds fantastic to me! In practice, there's so much about benchmarking that I'm ignorant about, I have to defer to others. |
Very nice. @mscdex @bnoordhuis ... any thoughts on this? |
var s; | ||
for (var i = 0; i < n; i++) { | ||
s = '01234567890'; | ||
s[1] = 'a'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this line was added to prevent v8 from optimizing the for-loop away or something (since s
wouldn't have been referenced)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. With use strict
it is definitely broken. Looking at the original commit ( 12a169e - 6 years ago) it seams like it was just a misunderstanding of how strings works. The commit appears to compare strings and buffers, which is not comparable in this case as strings are immutable.
Just briefly looking over it, it mostly seems to look ok except for a few nits. I did spot a typo in the |
// | ||
// Parse arguments | ||
// | ||
const cli = CLI(`usage: ./node benchmark/compare.js <type> ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should probably be some explanation (in the help text) about what <type>
should be exactly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you referring to <type>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, markdown cut that part out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc there was another one of these in another commit. just to watch out for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should all be fixed. Unless you are talking about the R scripts, but they only take --
arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, the first few times I tried running the new I found it very confusing that the type
argument needed to appear before the arguments starting with --
(i.e. compare.js --new bla --old blah http
did not work). I almost never use CLIs with that argument order, and just showing this usage text wasn’t exactly helpful, either.
You don’t need to change the behaviour, but maybe add a note here about that and for the other scripts where it applies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on that note, I think it is very specific, this is the message you get now.
usage: ./node benchmark/compare.js <type> ...
Run each benchmark in the <type> directory many times using two diffrent
node versions. More than one <type> directory can be specified. The output is
formatted as csv, which can be processed using for example 'compare.R'.
--new ./new-node-binary new node binary (required)
--old ./old-node-binary old node binary (required)
--runs 30 number of samples
--filter pattern string to filter benchmark scripts
--set variable=value set benchmark variable (can be repeated)
I choose this order, because it could be implemented using less code.
I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.
If I had to guess, I’d say it’s because that’s the order usually suggested in man pages and --help
texts, and maybe because the positional arguments are the ones one is most likely to spend more time editing before hitting enter… idk, maybe there’s more to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I understand the order is confusing (it is fixed now). But this is the third comment I got about a missing note, but unless I'm misunderstanding the comment, there is a note just one line below.
@mscdex thanks. Updated as suggested. |
ping |
``` | ||
|
||
## How to write a benchmark test | ||
After generating the csv, a comparens table can be created using the `scatter.R` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/comparens/comparison ?
/cc @nodejs/collaborators |
@mscdex What's the semver status of this? Major? |
@ChALkeR I don't know how benchmarks are covered when it comes to that kind of thing. I would guess they are treated like tests or docs since they are not a part of the runtime? |
I'll go for major, it makes things easier and less complicated. One thing that is not clear from the document is how the statistical significance is achieved. |
@mscdex Thanks for the suggestions, I will update the documentation tomorrow. @mcollina It runs each the benchmark a given number of times (
I think the compare documentation is fairly clear on this. But do tell me how I can improve it. |
Thanks for the review. Landed in ee2843b edbed3f 0f9bfaa f3463cf3061931b5c94ba9c753c1d75ee4d2b712 1f64ceba89a074f9e23196d019d56f00cdd4577a 01fbf656a3874d189cadeced08266a26ea526491 de9b44c0889d2264436277848762f1ebf868aa57 6e745d7a7586b12b894537192726bf2b999a456d 693e7be399e4c0964b5bbceaee6e8326c7c02a42 |
Uh, you might want to back these commits out of master for now, the linter complains about |
As in force push? |
@AndreasMadsen I’d do that for now. Could you fix that, and maybe do a CI or linter run before re-landing? ;) |
This removes the need for parsing stdout from the benchmarks. If the process wasn't executed by fork, it will just print like it used to. This also fixes the parsing of CLI arguments, by inferring the type from the options object instead of the value content. Only two benchmarks had to be changed: * http/http_server_for_chunky_client.js this previously used a spawn now it uses a fork and relays the messages using common.sendResult. * misc/v8-bench.js this utilized that v8/benchmark/run.js called global.print and reformatted the input. It now interfaces directly with the benchmark runner global.BenchmarkSuite. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>
Previously bench.end would call process.exit(0) however this is rather confusing and indeed a few benchmarks had code that assumed otherwise. This adds process.exit(0) to the benchmarks that needs it. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>
The data sampling is done in node and the data processing is done in R. Only plyr was added as an R dependency and it is fairly standard. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>
Previously this a tool in `plot.R`. It is now are more complete tool which executes the benchmarks many times and creates a boxplot. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>
PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>
Strings where never mutable, it is not clear what this benchmarks attempts to do. This did work at some point, but only because the benchmark wasn't using strict mode. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>
Thanks for the quick eye. I have force pushed and updated the PR. I wish I knew how it happened. |
Well, yeah, I’ve had the, ahem, pleasure of breaking master by not having run CI again before landing myself in the recent past. :) Anyway, CI looked good before it went all 502 (FreeBSD failure is unrelated and only the Windows tests were remaining), I’d say you can land this. Thanks! |
Labelled this semver-major because that’s what has been suggested above, and #7890 shows that people obviously were using APIs of the old benchmarking scripts. |
Sounds good. This is obviously not backward compatible and it is quite easy to use the new tools on an old node version. Also I don't really want to backport this ;) |
Enable `brace-style` in ESLint. Ref: nodejs#7094 (comment)
Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>
Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>
Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>
Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>
Checklist
Affected core subsystem(s)
benchmark
Description of change
I have been rather confused about the benchmark suite and I don't think it is as user friendly as the rest of nodecore. This PR attempt to remove most of the confusion I was facing when I started using it. Primarily it:
compare.js
such comparing node versions and getting statistical significance is easy.plot.R
tool (now calledscatter
) to show a scatter plot with confidence bars.process.exit(0)
afterbench.end()
.process.send
to avoid most parsing (the benchmark cli arguments haven't changed).The specifics are documented in the commit messages. Please also see the the new README as quite a lot have changed (be sure the to check my spelling!).
Note that some benchmark takes a very long time to complete, e.g.
timers/timers.js type=depth thousands=500
takes 11.25 min. Thus running it 30 times for statistical significance is unreasonable. I suspect the only reason why it is set to so many iterations is to get a small variance, but with the the newcompare
tool the variance can be estimated instead of being reduced. Thus we can reduce the number of iterations and still get the information we need. But I suggest we do that in another pull request, as is very different discussion.Motivation (long story): I wanted to benchmark the effect of some
async_wrap
changes. I went to thebenchmark/
directory and read the README. However I quickly discovered that it was primarily about running benchmarks a single time and how to write benchmarks. And most importantly it didn't explain how to compare two node versions. This is now documented in the new README.I then had to search for the tools myself and discovered the large amount of benchmarks files which where not put into categorized directories. I assumed they where somehow extra significant, but in reality they just appear to be unused. These files are now removed.
After discovering the compare tool, which has the cli API
I was confused about what the
--red
,--green
was and how thenode-binary1
andnode-binary2
compared, should I write./node-old ./node-new
or./node-new ./node-old
if I wanted a positive improvement factor to signify an improvement? The new compare API is:After understanding
common.js
this it was still unclear if the performance was statistically significant different. I tried running the benchmark 5 times and got that 4/5 was an improvement, I was expecting it to have the same performance or be slower. (spoiler: it wasn't significant). The compare.js script now runs the benchmarks many times (30 by default) and there is an R script to analyse the csv results.At this point I wanted to do a rewrite of the benchmark tools (not the benchmarks themself) and changed a few other things in the process as well. - I'm a mathematician so I care a lot about statistical significance :)