diff --git a/README.md b/README.md index ebbb357..8b6e8da 100644 --- a/README.md +++ b/README.md @@ -1,60 +1,95 @@ -# tachometer [![Build Status](https://github.com/Polymer/tachometer/actions/workflows/tests.yml/badge.svg)](https://github.com/Polymer/tachometer/actions/workflows/tests.yml) [![NPM package](https://img.shields.io/npm/v/tachometer.svg)](https://npmjs.org/package/tachometer) +# tachometer [![Build Status](https://github.com/Polymer/tachometer/actions/workflows/tests.yaml/badge.svg?branch=main)](https://github.com/Polymer/tachometer/actions/workflows/tests.yaml?query=branch%3Amain) [![NPM package](https://img.shields.io/npm/v/tachometer.svg)](https://npmjs.org/package/tachometer) > tachometer is a tool for running benchmarks in web browsers. It uses repeated -> sampling and statistics to reliably identify even the smallest differences in -> timing. +> sampling and statistics to reliably identify even tiny differences in runtime. + +###### [Install](#install) | [Usage](#usage) | [Why?](#why) | [Example](#example) | [Features](#features) | [Sampling](#sampling) | [Measurement modes](#measurement-modes) | [Interpreting results](#interpreting=results) | [Swap NPM dependencies](#swap-npm-dependencies) | [JavaScript module imports](#javascript-module-imports) | [Browsers](#browsers) | [Performance traces](#performance-traces) | [Remote control](#remote-control) | [Config file](#config-file) | [CLI usage](#cli-usage) + +## Install + +```sh +npm i tachometer +``` + +## Usage + +```sh +npx tachometer bench1.html [bench2.html ...] +``` ## Why? -Benchmarking is _hard_. Even if you run the exact same JavaScript, on the same -browser, on the same machine, on the same day, you will likely get a -significantly different result every time you measure. For this reason, at first -pass, it is often very difficult to say anything meaningful about the -performance of a script. +Even if you run the same JavaScript, on the same browser, on the same machine, +on the same day, you'll still get a different result every time. But if you take +enough _repeated samples_ and apply the right statistics, you can reliably +identify even tiny differences in runtime. -But there is signal in the noise. Scripts do have true underlying performance -characteristics on average. By taking enough _repeated samples_ and applying the -right statistics, we can reliably identify small differences and quantify our -confidence in them. +## Example -## Quick Start +Let's test two approaches for adding elements to a page. First ceate two HTML +files: -1. Install tachometer from NPM. +`inner.html` - ```sh - $ npm i tachometer - ``` +```html +<script type="module"> + import * as bench from '/bench.js'; + bench.start(); + for (let i = 0; i < 100; i++) { + document.body.innerHTML += '<button></button>'; + } + bench.stop(); +</script> +``` -2. Create a simple `forloop.html` micro benchmark that times a `for` loop. - tachometer benchmarks are HTML files that import and call `bench.start()` and - `bench.stop()`. Note that when you are measuring [first contentful - paint](#first-contentful-paint-fcp), you don't need to call these functions. - - ```html - <html> - <body> - <script type="module"> - import * as bench from '/bench.js'; - bench.start(); - for (let i = 0; i < 1000; i++) {} - bench.stop(); - </script> - </body> - </html> - ``` +`append.html` -3. Launch tachometer, which will launch Chrome and execute the benchmark 50 - times. +```html +<script type="module"> + import * as bench from '/bench.js'; + bench.start(); + for (let i = 0; i < 100; i++) { + document.body.append(document.createElement('button')); + } + bench.stop(); +</script> +``` - ```sh - $ tach forloop.html - ``` +Now run tachometer: - Along with some other information, tachometer will show you a range of - plausible values for how long this benchmark takes to run (more precisely, a - _95% confidence interval_, which is explained [below]()). +```sh +npx tachometer append.html inner.html +``` + +Tachometer opens Chrome and loads each HTML file, measuring the time between +`bench.start()` and `bench.stop()`. It round-robins between the two files, +running each at least 50 times. - <img src="./images/screen1.png"> +``` +[==============================================------------] 79/100 chrome append.html +``` + +After a few seconds, the results are ready: + +``` +┌─────────────┬─────────────────┬─────────────────┬─────────────────┐ +│ Benchmark │ Avg time │ vs inner.html │ vs append.html │ +├─────────────┼─────────────────┼─────────────────┼─────────────────┤ +│ inner.html │ 7.23ms - 8.54ms │ │ slower │ +│ │ │ - │ 851% - 1091% │ +│ │ │ │ 6.49ms - 7.80ms │ +├─────────────┼─────────────────┼─────────────────┼─────────────────┤ +│ append.html │ 0.68ms - 0.79ms │ faster │ │ +│ │ │ 90% - 92% │ - │ +│ │ │ 6.49ms - 7.80ms │ │ +└─────────────┴─────────────────┴─────────────────┴─────────────────┘ +``` + +This tells us that using the `document.body.append` approach instead of the +`innerHTML` approach would be between 90% and 92% faster on average. The ranges +tachometer reports are 95% confidence intervals for the percent change from one +benchmark to another. See _[Interpreting results](#interpreting-results)_ for more +information. ## Features @@ -77,6 +112,91 @@ confidence in them. - [_Remote control_](#remote-control) browsers running on different machines using remote WebDriver. +## Sampling + +### Minimum sample size + +By default, a **minimum of 50 samples** are taken from **each** benchmark. You +can change the minimum sample size with the `--sample-size` flag or the +`sampleSize` JSON config option. + +### Auto sampling + +After the initial 50 samples, tachometer will continue taking samples until +there is a clear statistically significant difference between all benchmarks, +for **up to 3 minutes**. + +You can change this duration with the `--timeout` flag or the `timeout` JSON +config option, measured in minutes. Set `--timeout=0` to disable auto sampling +entirely. Set `--timeout=60` to sample for up to an hour. + +### Horizons + +You can also configure which statistical conditions tachometer should check for +when deciding when to stop auto sampling by configuring _horizons_. + +To set horizons from the command-line, use the `--horizon` flag with a +comma-delimited list: + +```sh +--horizon=0%,10% +``` + +To set horizons from a JSON config file, use the `horizons` property with an +array of strings (including if there is only one condition): + +```json +{ + "horizons": ["0%", "10%"] +} +``` + +A horizon can be thought of as a point of interest on the number-line of either +absolute milliseconds, or relative percent change. By setting a horizon, you are +asking tachometer to try to shrink the confidence interval until it is +unambiguously placed on one side or the other of that horizon. + +| Example horizon | Question | +| ------------------- | ---------------------------------------------------------- | +| `0%` | Is A faster or slower than B _at all_? (The **default**) | +| `10%` | Is A faster or slower than B by at least 10%? | +| `+10%` | Is A slower than B by at least 10%? | +| `-10%` | Is A faster than B by at least 10%? | +| `-10%`, `+10%` | (Same as `10%`) | +| `0%`, `10%`, `100%` | Is A at all, a little, or a lot slower or faster than B? | +| `0.5ms` | Is A faster or slower than B by at least 0.5 milliseconds? | + +In the following example, we have set `--horizon=10%`, meaning we are interested +in knowing whether A differs from B by at least 10% in either direction. The +sample size automatically increases until the confidence interval is narrow +enough to place the estimated difference squarely on one side or the other of +both horizons. + +``` + <-------------------------------> n=50 ❌ -10% ❌ +10% + <------------------> n=100 ✔️ -10% ❌ +10% + <-----> n=200 ✔️ -10% ✔️ +10% + + |---------|---------|---------|---------| difference in runtime +-20% -10% 0 +10% +20% + +n = sample size +<---> = confidence interval for percent difference of mean runtimes +✔️ = resolved horizon +❌ = unresolved horizon +``` + +In this example, by `n=50` we are not sure whether A is faster or slower than B +by more than 10%. By `n=100` we have ruled out that B is _faster_ than A by more +than 10%, but we're still not sure if it's _slower_ by more than 10%. By `n=200` +we have also ruled out that B is slower than A by more than 10%, so we stop +sampling. Note that we still don't know which is _absolutely_ faster, we just +know that whatever the difference is, it is neither faster nor slower than 10% +(and if we did want to know, we could add `0` to our horizons). + +Note that, if the _actual_ difference is very close to a horizon, then it is +likely that the horizon will never be met, and the timeout will expire. + ## Measurement modes Tachometer supports four modes of time interval measurements, controlled with @@ -173,20 +293,21 @@ renders any DOM content. Currently, only Chrome supports the performance timeline entry. In this mode, calling the `start()` and `stop()` functions is not required, and has no effect. -## Average runtime +## Interpreting results -When you execute just one benchmark, you'll get a single result: the **_average -runtime_** of the benchmark, presented as a _95% confidence interval_ (see -[below](#confidence-intervals) for interpretation) for the number of -milliseconds that elapsed between `bench.start()` and `bench.stop()`. +### Average runtime + +The first column of output is the **_average runtime_** of the benchmark. This +is a _95% confidence interval_ for the number of milliseconds that elapsed +during the benchmark. When you run only one benchmark, this is the only output. <img src="./images/screen1.png"></img> -## Difference table +### Difference table -When you run multiple benchmarks together in the same session, you'll get an NxN -table summarizing all of the _differences_ in runtimes, both in _absolute_ and -_relative_ terms (percent-change). +When you run multiple benchmarks together, you'll get an NxN table summarizing +all of the _differences_ in runtimes, both in _absolute_ and _relative_ terms +(percent-change). In this example screenshot we're comparing `for` loops, each running with a different number of iterations (1, 1000, 1001, and 3000): @@ -206,6 +327,33 @@ This table tells us: which is faster, because the difference was too small. 1000 iterations could be as much as 13% faster, or as much as 21% slower, than 1001 iterations. +## Confidence intervals + +Loosely speaking, a confidence interval is a range of plausible values for a +parameter like runtime, and the _confidence level_ (which tachometer always +fixes to _95%_) corresponds to the degree of confidence we have that interval +contains the _true value_ of that parameter. See +[Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval#Meaning_and_interpretation) +for more information about confidence intervals. + +``` + <-------------> Wider confidence interval + High variance and/or low sample size + + <---> Narrower confidence interval + Low variance and/or high sample size + + |---------|---------|---------|---------| +-1% -0.5% 0% +0.5% +1% +``` + +The way tachometer shrinks confidence intervals is by **increasing the sample +size**. The [central limit +theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) means that, even +when we have high variance data, and even when that data is not normally +distributed, as we take more and more samples, we'll be able to calculate a more +and more precise estimate of the true mean of the data. + ## Swap NPM dependencies Tachometer has specialized support for swapping in custom versions of any NPM @@ -295,122 +443,6 @@ When you specify a dependency to swap, the following happens: > to install it is the same. To _always_ do a fresh `npm install`, set the > `--force-clean-npm-install` flag. -## Confidence intervals - -The most important concept needed to interpret results from tachometer is the -**_confidence interval_**. Loosely speaking, a confidence interval is a range of -_plausible values_ for a parameter (e.g. runtime), and the _confidence level_ -(which we fix at _95%_) corresponds to the degree of confidence we have that -interval contains the _true value_ of that parameter. - -> More precisely, the 95% confidence level describes the _long-run proportion of -> confidence intervals that will contain the true value_. Hypothetically, if you -> run tachometer over and over again in the same configuration, then while you'll -> get a slightly different confidence interval every time, it should be the case -> that _95% of those confidence intervals will contain the true value_. See -> [Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval#Meaning_and_interpretation) -> for more information on interpreting confidence intervals. - -The _width_ of a confidence interval determines the range of values it includes. -Narrower confidence intervals give you a more precise estimate of what the true -value might be. In general, we want narrower confidence intervals. - -``` - <-------------> Wider confidence interval - High variance and/or low sample size - - <---> Narrower confidence interval - Low variance and/or high sample size - - |---------|---------|---------|---------| --1% -0.5% 0% +0.5% +1% -``` - -Three knobs can shrink our confidence intervals: - -1. Dropping the chosen confidence level. _This is not a good idea!_ We want our - results to be _consistently reported with high confidence_, so we always use - 95% confidence intervals. - -2. Decreasing the variation in the benchmark timing measurements. _This is hard - to do_. A great many factors lead to variation in timing measurements, most - of which are very difficult to control, including some that are - [intentionally built - in](https://developers.google.com/web/updates/2018/02/meltdown-spectre#high-resolution_timers)! - -3. Increasing the sample size. The [central limit - theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) means that, - even when we have high variance data, and even when that data is not normally - distributed, as we take more and more samples, we'll be able to calculate a - more and more precise estimate of the true mean of the data. _Increasing the - sample size is the main knob we have._ - -## Sample size - -By default, a minimum of 50 samples are taken from each benchmark. The -preliminary results from these samples may or may not be precise enough to allow -you to to draw a statistically significant conclusion. - -> For example, if you are interested in knowing which of A and B are faster, but -> you find that the confidence interval for the percent change between the mean -> runtimes of A and B _includes zero_ (e.g. `[-3.08%, +2.97%]`), then it is -> clearly not possible to draw a conclusion about whether A is faster than B or -> vice-versa. - -## Auto sampling - -After the initial 50 samples, tachometer will continue drawing samples until -either certain stopping conditions that you specify are met, or until a timeout -expires (3 minutes by default). - -The stopping conditions for auto-sampling are specified in terms of -**_horizons_**. A horizon can be thought of as a _point of interest_ on the -number-line of either absolute or relative differences in runtime. By setting a -horizon, you are asking tachometer to try to _shrink the confidence interval -until it is unambiguously placed on one side or the other of that horizon_. - -| Example horizon | Question | -| --------------- | ---------------------------------------------------------- | -| `0%` | Is X faster or slower than Y _at all_? | -| `10%` | Is X faster or slower than Y by at least 10%? | -| `+10%` | Is X slower than Y by at least 10%? | -| `-10%` | Is X faster than Y by at least 10%? | -| `-10%,+10%` | (Same as `10%`) | -| `0%,10%,100%` | Is X at all, a little, or a lot slower or faster than Y? | -| `0.5ms` | Is X faster or slower than Y by at least 0.5 milliseconds? | - -In the following visual example, we have set `--horizon=10%` meaning that we are -interested in knowing whether A differs from B by at least 10% in either -direction. The sample size automatically increases until the confidence interval -is narrow enough to place the estimated difference squarely on one side or the -other of both horizons. - -``` - <-------------------------------> n=50 ❌ -10% ❌ +10% - <------------------> n=100 ✔️ -10% ❌ +10% - <-----> n=200 ✔️ -10% ✔️ +10% - - |---------|---------|---------|---------| difference in runtime --20% -10% 0 +10% +20% - -n = sample size -<---> = confidence interval for percent difference of mean runtimes -✔️ = resolved horizon -❌ = unresolved horizon -``` - -In the example, by `n=50` we are not sure whether A is faster or slower than B -by more than 10%. By `n=100` we have ruled out that B is _faster_ than A by more -than 10%, but we're still not sure if it's _slower_ by more than 10%. By `n=200` -we have also ruled out that B is slower than A by more than 10%, so we stop -sampling. Note that we still don't know which is _absolutely_ faster, we just -know that whatever the difference is, it is neither faster nor slower than 10% -(and if we did want to know, we could add `0` to our horizons). - -Note that, if the actual difference is very close to a horizon, then it is -likely that the precision stopping condition will never be met, and the timeout -will expire. - ## JavaScript module imports JavaScript module imports with _bare module specifiers_ (e.g. `import {foo} from 'mylib';`) will be automatically transformed to browser-compatible _path_ @@ -526,7 +558,7 @@ For example, using the standard location of the default user profile on macOS: } ``` -### Performance traces +## Performance traces Once you determine that something is slower or faster in comparison to something else, investigating why is natural next step. To assist in determining why, @@ -673,7 +705,7 @@ Defaults are the same as the corresponding command-line flags. "root": "./benchmarks", "sampleSize": 50, "timeout": 3, - "autoSampleConditions": ["0%", "1%"], + "horizons": ["0%", "1%"], "benchmarks": [ { "name": "foo", @@ -738,7 +770,7 @@ Which is equivalent to: } ``` -## Usage +## CLI usage Run a benchmark from a local file: @@ -775,7 +807,7 @@ tach http://example.com | `--browser` / `-b` | `chrome` | Which browsers to launch in automatic mode, comma-delimited (chrome, firefox, safari, edge, ie) ([details](#browsers)) | | `--window-size` | `1024,768` | "width,height" in pixels of the browser windows that will be created | | `--sample-size` / `-n` | `50` | Minimum number of times to run each benchmark ([details](#sample-size)] | -| `--horizon` | `10%` | The degrees of difference to try and resolve when auto-sampling ("N%" or "Nms", comma-delimited) ([details](#auto-sampling)) | +| `--horizon` | `0%` | The degrees of difference to try and resolve when auto-sampling ("N%" or "Nms", comma-delimited) ([details](#auto-sampling)) | | `--timeout` | `3` | The maximum number of minutes to spend auto-sampling ([details](#auto-sampling)) | | `--measure` | `callback` | Which time interval to measure (`callback`, `global`, `fcp`) ([details](#measurement-modes)) | | `--measurement-expression` | `window.tachometerResult` | JS expression to poll for on page to retrieve measurement result when `measure` setting is set to `global` | diff --git a/package.json b/package.json index 19e63d1..f3fe11b 100644 --- a/package.json +++ b/package.json @@ -17,7 +17,9 @@ "lint": "eslint .", "format": "prettier --write .", "format:check": "prettier --check .", - "test": "npm run build && mocha 'lib/test/**/*.js'" + "test": "npm run test:unit && npm run test:e2e", + "test:unit": "mocha \"lib/test/**/*_test.js\" --grep \".*e2e.*\" --invert", + "test:e2e": "mocha \"lib/test/**/*_test.js\" --grep \".*e2e.*\"" }, "repository": { "type": "git",