Refactor benchmark tools for statistical significance #7094

AndreasMadsen · 2016-06-01T15:18:38Z

Checklist

tests and code linting passes
a test and/or benchmark is included
documentation is changed or added
the commit message follows commit guidelines

Affected core subsystem(s)

benchmark

Description of change

I have been rather confused about the benchmark suite and I don't think it is as user friendly as the rest of nodecore. This PR attempt to remove most of the confusion I was facing when I started using it. Primarily it:

removes unused/undocumented files
allows partially setting the benchmarks variables using process arguments.
refactor compare.js such comparing node versions and getting statistical significance is easy.
refactor the plot.R tool (now called scatter) to show a scatter plot with confidence bars.
refactor cli tools such the cli API is more homogeneous.
documents all the tools.
removes the implicit process.exit(0) after bench.end().
uses process.send to avoid most parsing (the benchmark cli arguments haven't changed).

The specifics are documented in the commit messages. Please also see the the new README as quite a lot have changed (be sure the to check my spelling!).

Note that some benchmark takes a very long time to complete, e.g. timers/timers.js type=depth thousands=500 takes 11.25 min. Thus running it 30 times for statistical significance is unreasonable. I suspect the only reason why it is set to so many iterations is to get a small variance, but with the the new compare tool the variance can be estimated instead of being reduced. Thus we can reduce the number of iterations and still get the information we need. But I suggest we do that in another pull request, as is very different discussion.

Motivation (long story): I wanted to benchmark the effect of some async_wrap changes. I went to the benchmark/ directory and read the README. However I quickly discovered that it was primarily about running benchmarks a single time and how to write benchmarks. And most importantly it didn't explain how to compare two node versions. This is now documented in the new README.

I then had to search for the tools myself and discovered the large amount of benchmarks files which where not put into categorized directories. I assumed they where somehow extra significant, but in reality they just appear to be unused. These files are now removed.

After discovering the compare tool, which has the cli API

node benchmark/compare.js
            <node-binary1> <node-binary2> +
            [--html] [--red|-r] [--green|-g] +
            [-- <type> [testFilter]]

I was confused about what the --red, --green was and how the node-binary1 and node-binary2 compared, should I write ./node-old ./node-new or ./node-new ./node-old if I wanted a positive improvement factor to signify an improvement? The new compare API is:

usage: ./node benchmark/compare.js <type> ...
  --new    ./new-node-binary  new node binary (required)
  --old    ./old-node-binary  old node binary (required)
  --runs   30                 number of samples
  --filter pattern            string to filter benchmark scripts
  --var    variable=value     set benchmark variable (can be repeated)

After understanding common.js this it was still unclear if the performance was statistically significant different. I tried running the benchmark 5 times and got that 4/5 was an improvement, I was expecting it to have the same performance or be slower. (spoiler: it wasn't significant). The compare.js script now runs the benchmarks many times (30 by default) and there is an R script to analyse the csv results.

At this point I wanted to do a rewrite of the benchmark tools (not the benchmarks themself) and changed a few other things in the process as well. - I'm a mathematician so I care a lot about statistical significance :)

AndreasMadsen · 2016-06-04T07:02:38Z

I'm not sure who to cc for this one.
/cc @Trott as you appear to have made some resent benchmark changes.

Trott · 2016-06-05T04:37:28Z

@nodejs/benchmarking

Trott · 2016-06-05T04:39:41Z

In theory, this sounds fantastic to me! In practice, there's so much about benchmarking that I'm ignorant about, I have to defer to others.

jasnell · 2016-06-06T14:33:41Z

Very nice. @mscdex @bnoordhuis ... any thoughts on this?

mscdex · 2016-06-06T14:41:59Z

benchmark/misc/string-creation.js

-  var s;
-  for (var i = 0; i < n; i++) {
-    s = '01234567890';
-    s[1] = 'a';


Perhaps this line was added to prevent v8 from optimizing the for-loop away or something (since s wouldn't have been referenced)?

Perhaps. With use strict it is definitely broken. Looking at the original commit ( 12a169e - 6 years ago) it seams like it was just a misunderstanding of how strings works. The commit appears to compare strings and buffers, which is not comparable in this case as strings are immutable.

mscdex · 2016-06-06T15:09:38Z

Just briefly looking over it, it mostly seems to look ok except for a few nits.

I did spot a typo in the benchmark: add script for creating scatter plot commit message body.

mscdex · 2016-06-06T15:10:25Z

benchmark/compare.js

+//
+// Parse arguments
+//
+const cli = CLI(`usage: ./node benchmark/compare.js <type> ...


There should probably be some explanation (in the help text) about what <type> should be exactly...

are you referring to <type>?

Yes, markdown cut that part out.

iirc there was another one of these in another commit. just to watch out for it.

They should all be fixed. Unless you are talking about the R scripts, but they only take -- arguments.

btw, the first few times I tried running the new I found it very confusing that the type argument needed to appear before the arguments starting with -- (i.e. compare.js --new bla --old blah http did not work). I almost never use CLIs with that argument order, and just showing this usage text wasn’t exactly helpful, either.

You don’t need to change the behaviour, but maybe add a note here about that and for the other scripts where it applies?

Can you elaborate on that note, I think it is very specific, this is the message you get now.

usage: ./node benchmark/compare.js <type> ... Run each benchmark in the <type> directory many times using two diffrent node versions. More than one <type> directory can be specified. The output is formatted as csv, which can be processed using for example 'compare.R'. --new ./new-node-binary new node binary (required) --old ./old-node-binary old node binary (required) --runs 30 number of samples --filter pattern string to filter benchmark scripts --set variable=value set benchmark variable (can be repeated)

I choose this order, because it could be implemented using less code.

I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.

I will try and change the argument order, this appears to cause a lot of confusion for many people, but I would love to understand why.

If I had to guess, I’d say it’s because that’s the order usually suggested in man pages and --help texts, and maybe because the positional arguments are the ones one is most likely to spend more time editing before hitting enter… idk, maybe there’s more to it.

Oh I understand the order is confusing (it is fixed now). But this is the third comment I got about a missing note, but unless I'm misunderstanding the comment, there is a note just one line below.

AndreasMadsen · 2016-06-06T18:13:18Z

@mscdex thanks. Updated as suggested.

AndreasMadsen · 2016-06-11T14:20:32Z

ping

mscdex · 2016-06-11T14:25:51Z

benchmark/README.md

 ```

-## How to write a benchmark test
+After generating the csv, a comparens table can be created using the `scatter.R`


s/comparens/comparison ?

mscdex · 2016-06-11T14:48:25Z

/cc @nodejs/collaborators

ChALkeR · 2016-06-11T15:18:41Z

@mscdex What's the semver status of this? Major?

mscdex · 2016-06-11T15:24:41Z

@ChALkeR I don't know how benchmarks are covered when it comes to that kind of thing. I would guess they are treated like tests or docs since they are not a part of the runtime?

mcollina · 2016-06-14T17:31:54Z

I'll go for major, it makes things easier and less complicated.

One thing that is not clear from the document is how the statistical significance is achieved.

AndreasMadsen · 2016-06-14T19:22:23Z

@mscdex Thanks for the suggestions, I will update the documentation tomorrow.

@mcollina It runs each the benchmark a given number of times (--runs) using the new and old node binary that is provided to compare.js. Using the R script it then ...

... makes an independent/unpaired 2-group t-test, with the null hypothesis that the performance is the same for both versions. The significant field will show a star if the p-value is less than 0.05.

I think the compare documentation is fairly clear on this. But do tell me how I can improve it.

AndreasMadsen · 2016-07-26T11:00:14Z

Thanks for the review. Landed in ee2843b edbed3f 0f9bfaa f3463cf3061931b5c94ba9c753c1d75ee4d2b712 1f64ceba89a074f9e23196d019d56f00cdd4577a 01fbf656a3874d189cadeced08266a26ea526491 de9b44c0889d2264436277848762f1ebf868aa57 6e745d7a7586b12b894537192726bf2b999a456d 693e7be399e4c0964b5bbceaee6e8326c7c02a42

addaleax · 2016-07-26T11:08:32Z

Uh, you might want to back these commits out of master for now, the linter complains about benchmark/_cli.js

AndreasMadsen · 2016-07-26T11:12:29Z

As in force push?

addaleax · 2016-07-26T11:12:44Z

@AndreasMadsen I’d do that for now. Could you fix that, and maybe do a CI or linter run before re-landing? ;)

This removes the need for parsing stdout from the benchmarks. If the process wasn't executed by fork, it will just print like it used to. This also fixes the parsing of CLI arguments, by inferring the type from the options object instead of the value content. Only two benchmarks had to be changed: * http/http_server_for_chunky_client.js this previously used a spawn now it uses a fork and relays the messages using common.sendResult. * misc/v8-bench.js this utilized that v8/benchmark/run.js called global.print and reformatted the input. It now interfaces directly with the benchmark runner global.BenchmarkSuite. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>

Previously bench.end would call process.exit(0) however this is rather confusing and indeed a few benchmarks had code that assumed otherwise. This adds process.exit(0) to the benchmarks that needs it. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>

The data sampling is done in node and the data processing is done in R. Only plyr was added as an R dependency and it is fairly standard. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>

Previously this a tool in `plot.R`. It is now are more complete tool which executes the benchmarks many times and creates a boxplot. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>

PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>

Strings where never mutable, it is not clear what this benchmarks attempts to do. This did work at some point, but only because the benchmark wasn't using strict mode. PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>

AndreasMadsen · 2016-07-26T11:25:23Z

Thanks for the quick eye. I have force pushed and updated the PR. I wish I knew how it happened.

CI: https://ci.nodejs.org/job/node-test-pull-request/3422/

addaleax · 2016-07-26T11:57:12Z

Well, yeah, I’ve had the, ahem, pleasure of breaking master by not having run CI again before landing myself in the recent past. :)

Anyway, CI looked good before it went all 502 (FreeBSD failure is unrelated and only the Windows tests were remaining), I’d say you can land this. Thanks!

AndreasMadsen · 2016-07-26T12:27:28Z

Landed in: ee2843b edbed3f 0f9bfaa f99471b 8bb59fd 855009a 0c0f34e 6edef1d d525e6c

addaleax · 2016-07-27T08:43:00Z

Labelled this semver-major because that’s what has been suggested above, and #7890 shows that people obviously were using APIs of the old benchmarking scripts.

AndreasMadsen · 2016-07-27T09:27:20Z

Sounds good. This is obviously not backward compatible and it is quite easy to use the new tools on an old node version.

Also I don't really want to backport this ;)

Enable `brace-style` in ESLint. Ref: nodejs#7094 (comment)

Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>

AndreasMadsen added the benchmark Issues and PRs related to the benchmark subsystem. label Jun 1, 2016

mscdex reviewed Jun 6, 2016
View reviewed changes

mscdex reviewed Jun 11, 2016
View reviewed changes

AndreasMadsen closed this Jul 26, 2016

addaleax reopened this Jul 26, 2016

AndreasMadsen added 6 commits July 26, 2016 13:21

benchmark: update docs after refactor

6edef1d

PR-URL: #7094 Reviewed-By: Trevor Norris <trev.norris@gmail.com> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Brian White <mscdex@mscdex.net> Reviewed-By: Anna Henningsen <anna@addaleax.net>

AndreasMadsen merged commit d525e6c into nodejs:master Jul 26, 2016

addaleax added the semver-major PRs that contain breaking changes and should be released in the next major version. label Jul 27, 2016

Trott added a commit to Trott/io.js that referenced this pull request Aug 31, 2016

tools: enforce JS brace style with linting

d1e2db2

Enable `brace-style` in ESLint. Ref: nodejs#7094 (comment)

MylesBorins pushed a commit that referenced this pull request Sep 4, 2016

tools: enforce JS brace style with linting

9bddd79

Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>

MylesBorins pushed a commit that referenced this pull request Sep 28, 2016

tools: enforce JS brace style with linting

fa3ec3d

Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>

rvagg pushed a commit that referenced this pull request Oct 18, 2016

tools: enforce JS brace style with linting

ab56c05

Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>

MylesBorins pushed a commit that referenced this pull request Oct 26, 2016

tools: enforce JS brace style with linting

d1a50b3

Enable `brace-style` in ESLint. Ref: #7094 (comment) PR-URL: #8348 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Myles Borins <myles.borins@gmail.com>

This was referenced May 16, 2017

benchmark: fix CLI arguments check in common.js #12429

Closed

benchmark: improve cli error message #12421

Closed

Revert "benchmark: fix CLI arguments check in common.js" #12474

Closed

gibfahn mentioned this pull request Jun 15, 2017

Auditing for 6.11.1 nodejs/Release#230

Closed

3 tasks

refack mentioned this pull request Aug 9, 2017

Install of Rscript on new benchmarking machines nodejs/build#821

Closed

joyeecheung mentioned this pull request Oct 20, 2017

Audit benchmarks on v6.x nodejs/benchmarking#162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor benchmark tools for statistical significance #7094

Refactor benchmark tools for statistical significance #7094

AndreasMadsen commented Jun 1, 2016 •

edited

Loading

AndreasMadsen commented Jun 4, 2016

Trott commented Jun 5, 2016

Trott commented Jun 5, 2016

jasnell commented Jun 6, 2016

mscdex Jun 6, 2016

AndreasMadsen Jun 6, 2016

mscdex commented Jun 6, 2016

mscdex Jun 6, 2016 •

edited

Loading

AndreasMadsen Jun 6, 2016

mscdex Jun 6, 2016

trevnorris Jun 15, 2016

AndreasMadsen Jun 15, 2016

addaleax Jul 23, 2016

AndreasMadsen Jul 23, 2016

addaleax Jul 23, 2016

AndreasMadsen Jul 23, 2016 •

edited

Loading

AndreasMadsen commented Jun 6, 2016

AndreasMadsen commented Jun 11, 2016

mscdex Jun 11, 2016

mscdex commented Jun 11, 2016

ChALkeR commented Jun 11, 2016

mscdex commented Jun 11, 2016

mcollina commented Jun 14, 2016

AndreasMadsen commented Jun 14, 2016 •

edited

Loading

AndreasMadsen commented Jul 26, 2016

addaleax commented Jul 26, 2016

AndreasMadsen commented Jul 26, 2016

addaleax commented Jul 26, 2016 •

edited

Loading

AndreasMadsen commented Jul 26, 2016 •

edited

Loading

addaleax commented Jul 26, 2016

AndreasMadsen commented Jul 26, 2016

addaleax commented Jul 27, 2016

AndreasMadsen commented Jul 27, 2016

Refactor benchmark tools for statistical significance #7094

Refactor benchmark tools for statistical significance #7094

Conversation

AndreasMadsen commented Jun 1, 2016 • edited Loading

Checklist

Affected core subsystem(s)

Description of change

AndreasMadsen commented Jun 4, 2016

Trott commented Jun 5, 2016

Trott commented Jun 5, 2016

jasnell commented Jun 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mscdex commented Jun 6, 2016

mscdex Jun 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndreasMadsen Jul 23, 2016 • edited Loading

Choose a reason for hiding this comment

AndreasMadsen commented Jun 6, 2016

AndreasMadsen commented Jun 11, 2016

Choose a reason for hiding this comment

mscdex commented Jun 11, 2016

ChALkeR commented Jun 11, 2016

mscdex commented Jun 11, 2016

mcollina commented Jun 14, 2016

AndreasMadsen commented Jun 14, 2016 • edited Loading

AndreasMadsen commented Jul 26, 2016

addaleax commented Jul 26, 2016

AndreasMadsen commented Jul 26, 2016

addaleax commented Jul 26, 2016 • edited Loading

AndreasMadsen commented Jul 26, 2016 • edited Loading

addaleax commented Jul 26, 2016

AndreasMadsen commented Jul 26, 2016

addaleax commented Jul 27, 2016

AndreasMadsen commented Jul 27, 2016

AndreasMadsen commented Jun 1, 2016 •

edited

Loading

mscdex Jun 6, 2016 •

edited

Loading

AndreasMadsen Jul 23, 2016 •

edited

Loading

AndreasMadsen commented Jun 14, 2016 •

edited

Loading

addaleax commented Jul 26, 2016 •

edited

Loading

AndreasMadsen commented Jul 26, 2016 •

edited

Loading