-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Tests: Only run 1 round of tests during PR commits #45761
Conversation
Open in CodeSandbox Web Editor | VS Code | VS Code Insiders |
Size Change: 0 B Total Size: 1.3 MB ℹ️ View Unchanged
|
4fa0b8c
to
000ff26
Compare
.github/workflows/performance.yml
Outdated
@@ -61,7 +61,7 @@ jobs: | |||
WP_VERSION=$(awk -F ': ' '/^Tested up to/{print $2}' readme.txt) | |||
IFS=. read -ra WP_VERSION_ARRAY <<< "$WP_VERSION" | |||
WP_MAJOR="${WP_VERSION_ARRAY[0]}.${WP_VERSION_ARRAY[1]}" | |||
./bin/plugin/cli.js perf $GITHUB_SHA debd225d007f4e441ceec80fbd6fa96653f94737 --tests-branch $GITHUB_SHA --wp-version "$WP_MAJOR" | |||
./bin/plugin/cli.js perf $GITHUB_SHA debd225d007f4e441ceec80fbd6fa96653f94737 --tests-branch $GITHUB_SHA --wp-version "$WP_MAJOR" --rounds 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind switching that one to "1" as well to see the impact it has on the graph (more or less variance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to do so, though I figured why make that change here when we don't need to. I'll update, but also I've been working on simulating the change, so maybe it will be good to compare the simulation to the real effect
I think right now, there's an issue with wordpress-develop GitHub repo causing our failing tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c1f4b4f
to
5ff5a96
Compare
In the performance tests CI workflow we have been running every test suite three times for each branch under test. The goal of this work, introduced in #33710, was to reduce variation in the reported data from the tests. Unfortunately after measuring the data produced by our test runs, and by running experiments that run the test suites thirty times over, the overall variation is explained primarily by noise in the Github Actions container running our jobs. If running the test suites three times each reduces the variation in the results then it's not detectable above the amount of variation introduced beyond our control. Because these additional rounds extend the perf-test runtime by around twenty minutes on each PR we're reducing the number of rounds to a single pass for PR commits. This will free up compute resources and remove the performance tests as a bottleneck in the PR workflow. Additional work can and should be done to further remove variance in the testing results, but this practical step will remove an obstacle from developer iteration speed without reducing the quality of the numbers being reported.
5ff5a96
to
4061516
Compare
What, how, why?
In the performance tests CI workflow we have been running every test suite three times for each branch under test. The goal of this work, introduced in #33710, was to reduce variation in the reported data from the tests.
Unfortunately after measuring the data produced by our test runs, and by running experiments that run the test suites thirty times over, the overall variation is explained primarily by noise in the Github Actions container running our jobs. If running the test suites three times each reduces the variation in the results then it's not detectable above the amount of variation introduced beyond our control.
Because these additional rounds extend the perf-test runtime by around twenty minutes on each PR we're reducing the number of rounds to a single pass for PR commits. This will free up compute resources and remove the performance tests as a bottleneck in the PR workflow.
Additional work can and should be done to further remove variance in the testing results, but this practical step will remove an obstacle from developer iteration speed without reducing the quality of the numbers being reported.
The following charts summarize two metrics from sixty runs of the performance test suite, looking at the raw data returned from the test runner. In the graphs we can see that even at a sampling size of 60 we still have large variation in the metrics, the median of which for the
firstBlock
still reports a difference of 160ms between the two test groups, even though they are testing the same code.Statistically speaking we cannot assert that these distributions are unequal, but they are close enough with a statistical significance at the
P = 0.154
level fortype
andP = 0.232
level forfirstBlock
that we should be careful about making any assertions based on these measurements.With three rounds, what we've been using, the spread of the distributions are higher but similarly unusable to detect changes. That is to say, we're uncomfortably close to identifying a false-positive performance impact after sixty rounds; imagine how hard at only three rounds it is to distinguish a real performance change from normal sampling variation due to factors outside our control (e.g. the GitHub Actions container).
All of this is to say that we aren't giving up much by jumping back down to a single round. Major performance changes will still appear out of the noise but changes on the order of which we typically see in the project history will remain hidden in the noise.
If we wanted to have confidence in performance changes given the data we have about our runtime environment then we might have to run the tests through a few hundred rounds. Unfortunately thirty rounds is about as high as we can push it before the test runner is cancelled at six hours of runtime. The data I collected with sixty rounds comes from two separate test runs, taking over four hours each, and a third test run which was killed at six hours and whose data was unavailable for these measurements.
Apart from that we would need to find a way to have more control over the runtime environment so that it's more deterministic. This probably implies running a self-hosted test runner. Since this isn't an option currently under discussion I propose that this change is a pragmatic mitigation to the performance test bottleneck.
Tests
Because this is changing only the test runner for the performance test ensure that the tests continue to pass and that the results look reasonable.
Please audit the code and look for ways this is introducing unwanted complexity; ask yourself if any parts of the change are unclear or inflexible with how we know we want to use the work in the near future.