-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Regression Testing Revamp Stage 1 #4602
Changes from 35 commits
26b0efb
848f381
f765c41
a59baf9
45d9233
ec83913
56dc946
672e9c9
bd281ba
d774fd9
d3dd754
38936cf
03e3419
92c267d
ea0b92b
f75bc01
f51138a
e5a7fad
f066d9d
dc565e2
63d34de
fc6e512
4c2b4a1
d86bc29
0d280dc
669903b
1b21253
6330650
a79c07d
8a08ba9
ed07f51
383fd1d
8ac060a
52f3155
68ae9e6
0a6cf96
9bfa6c3
5558e68
717bdab
09a282e
12fdaa5
f91f8da
8a347b5
76a482d
c4bc58b
5549f5e
0a065fd
d2355df
e71776a
309e16b
398cf72
cd229c0
5481a2e
0cb7534
1fd9168
a0deb8e
1455bc4
89585b1
04e1503
e631594
6785761
c3c41c7
28dc0f6
9cefe67
b2e7776
d1fc2ee
1838300
ff0abf5
b5a4d9d
892aba6
923bed3
0cfb691
cefb76c
5dbe170
9c1cb0f
dc525c6
ae18587
e9e9e1d
21fedee
d731edf
a772161
1d28405
89a4822
80735e7
c695838
be1d825
b61c35b
6328821
7bb8721
b563f50
2b9e6af
dcea10f
6291070
b929ec7
38bc1b8
e58e155
0157827
8a9deee
dcf666f
902b683
05e1bd9
090fd3c
d812ccb
3eccee7
041a214
55bece9
40998a5
fa5a886
08e84cc
b43eda1
b78842d
47c7351
7a8a6d0
47b33e2
4ab7a76
e472b54
38373fb
83bc505
c8112e1
0f64ce7
01667ec
81aee63
dad0691
c84e882
731ef1b
1161883
d35e591
fb6f9a5
2ad8cfb
ec54dc5
5b555ac
54d7bc2
5ae4f30
fdd23c1
b9627e2
dc5e946
d4b59dc
ba01919
bb791c6
a414e4d
2d18c15
25323f3
fd31552
9e6f54f
3c1f7b1
eb8e1c0
40a53d4
5c656e3
37a0ca8
49388c3
f1282b8
2ef524f
c22bc4d
2475a1f
d65bf76
1568528
8918836
9b41637
142ec21
43e9e32
9d0bf9a
f0efe54
3984024
60cde92
e2ff7c4
6fef3ef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,69 @@ | ||
# Performance Regression Testing | ||
This directory includes dbt project setups to test on and a test runner written in Rust which runs specific dbt commands on each of the projects. Orchestration is done via the GitHub Action workflow in `/.github/workflows/performance.yml`. The workflow is scheduled to run every night, but it can also be triggered manually. | ||
This test suite samples the performance characteristics of individual commits against performance models for prior releases. Performance is measured in project-command pairs which are assumed to conform to a normal distribution. The sampling and comparison is effecient enough to run against PRs. | ||
|
||
The github workflow hardcodes our baseline branch for performance metrics as `0.20.latest`. As future versions become faster, this branch will be updated to hold us to those new standards. | ||
This directory includes dbt project setups that are known performance bottlenecks, and a runner written in Rust that runs specific dbt commands on each of the projects. Orchestration is done via the GitHub Action workflow in `/.github/workflows/performance.yml`. | ||
|
||
## Adding a new dbt project | ||
Performance baselines measured during our release process are committed to this directory via github action. (TODO make the file and name it here). | ||
|
||
## Investigating Regressions | ||
|
||
If your commit has failed one of the performance regression tests, it does not necessarily mean your commit has a performance regression. However, the observed runtime value was so much slower than the expected value that it was unlikely to be random noise. This means that any commit after the release it is being compared against through this failing commit might contain the cause. Start by investigating the failing commit and working your way backwards. | ||
nathaniel-may marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## The Statistics | ||
Particle physicists need to be confident in declaring new discoveries, snack manufacturers need to be sure each snack is within the regulated margin of error for nutrition facts, and weight-rated climbing gear needs to be produced so you can trust your life to every unit that comes off the line. All of these use cases use the same kind of math to meet their needs: sigma-based p-values. This section will peel apart that math with the help of a physicist and walk through how we apply this approach to performance regression testing in this test suite. | ||
|
||
You are likely familiar with forming a hypothesis of the form "A and B are correlated" which is known as _the research hypothesis_. Additionally, it follows that the hypothesis "A and B are not correlated" is relevant and is known as _the null hypothesis_. When looking at data, we commonly use a _p-value_ to determine the significance of the data. Formally, a _p-value_ is the probability of obtaining data at least as extreme as the ones observed, if the null hypothesis is true. To refine this definition, The experimental partical physicist [Dr. Tommaso Dorigo](https://userswww.pd.infn.it/~dorigo/#about) has an excellent [glossary](https://www.science20.com/quantum_diaries_survivor/fundamental_glossary_higgs_broadcast-85365) of these terms that helps clarify: "'Extreme' is quite tricky instead: it depends on what is your 'alternate hypothesis' of reference, and what kind of departure it would produce on the studied statistic derived from the data. So 'extreme' will mean 'departing from the typical values expected for the null hypothesis, toward the values expected from the alternate hypothesis.'" In the context of performance regression testing, our research hypothesis is that "after commit A, the codebase includes a performance regression" which means we expect the runtime of our measured processes to be _slower_, not faster than the expected value. | ||
|
||
Given this definition of p-value, we need to explicitly call out the common tendancy to apply _probability inversion_ to our observations. To quote [Dr. Tommaso Dorigo](https://www.science20.com/quantum_diaries_survivor/fundamental_glossary_higgs_broadcast-85365) again, "If your ability on the long jump puts you in the 99.99% percentile, that does not mean that you are a kangaroo, and neither can one infer that the probability that you belong to the human race is 0.01%." Using our previously defined terms, the p-value is _not_ the probability that the null hypothesis _is true_. | ||
|
||
This brings us to calculating sigma values. Sigma refers to the standard deviation of a statistical model, which is used as a measurement of how far away an observed value is from the expected value. When we say that we have a "3 sigma result" we are saying that if the null hypothesis is true, this is a particularly unlikely observation. Not that the null hypothesis is true. Exactly how unlikely depends on what the expected values from our research hypothesis are. In the context of performance regression testing, if the null hypothesis is false, we are expecting the results to be _slower_ than the expected value not _slower or faster_. Looking at a normal distrubiton below, we can see that we only care about one _half_ of the distribution: the half where the values are slower than the expected value. This means that when we're calculating the p-value we are not including both sides of the normal distribution. | ||
|
||
![normal distibution](./images/normal.svg) | ||
|
||
Because of this, the following table describes the significance of each sigma level for our _one-sided_ hypothesis: | ||
|
||
| σ | p-value | scientific significance | | ||
| --- | -------------- | ----------------------- | | ||
| 1 σ | 1 in 6 | | | ||
| 2 σ | 1 in 44 | | | ||
| 3 σ | 1 in 741 | evidence | | ||
| 4 σ | 1 in 31,574 | | | ||
| 5 σ | 1 in 3,486,914 | discovery | | ||
|
||
When detecting performance regressions that trigger alerts, block PRs, or delay releases we want to be conservative enough that detections are infrequently triggered by noise, but not so conservative as to miss most actual regressions. This test suite uses a 3 sigma standard so that only about 1 in every 700 runs is expected to fail the performance regression test suite due to expected variance in our measurements. | ||
|
||
### Concrete Example | ||
|
||
The following example data was collected locally on a macbook pro using the same tools included in this repository. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't we want to collect the sample mean + stddev on the "same machine" as what will be running in CI? I know that's technically impossible, since GHA is a cloud service using VMs — but it's still possible to match the same basic architecture and memory characteristics, right? Versus a macbook pro running locally There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you're totally correct. The macbook numbers are all I have right now so I used those to give concrete numbers to the abstract concepts. Your comment here tells me that I should replace these with github action numbers once I have them though. |
||
|
||
In dbt v1.0.1, we have the following mean and standard deviation when parsing a dbt project with 2000 models: | ||
|
||
μ (mean): 49.82<br/> | ||
σ (stddev): 0.5212<br/> | ||
|
||
The 2-sided 3 sigma range can be calculated with these two values via: | ||
|
||
x < μ - 3 σ or x > μ + 3 σ<br/> | ||
x < 49.82 - 3 * 0.5212 or x > 49.82 + 3 * 0.5212 <br/> | ||
x < 48.26 or x > 51.38<br/> | ||
|
||
It follows that the 1-sided 3 sigma range for performance regressions is just:<br/> | ||
x > 51.38 | ||
|
||
If when we sample a single `dbt parse` of the same project with a commit slated to go into dbt v1.0.2 on the same macbook pro under the same conditions, we observe a 52s parse time, then this observation is so unlikely if there were no code-induced performance regressions, that we should investigate if there is a performance regression in any of the commits between this failure and the commit where the initial distribution was measured. | ||
|
||
Observations with 3 sigma significance that are _not_ performance regressions could be due to observing unlikely values (1 in every 741 observations), or variations in the instruments we use to take these measurements such as github actions. At this time we do not measure the variation in the instruments we use to account for these in our calculations. | ||
|
||
## Expanding the Tests | ||
Regression tests run pre-defined dbt commands accross a set of source-committed dbt projects that are known to cause performance bottlenecks. This collection of projects and commands should expand over time to reflect user feedback about poorly performing projects to protect against poor performance in these scenarios in future versions. | ||
|
||
### Adding a new dbt project | ||
Just make a new directory under `performance/projects/`. It will automatically be picked up by the tests. | ||
|
||
## Adding a new dbt command | ||
In `runner/src/measure.rs::measure` add a metric to the `metrics` Vec. The Github Action will handle recompilation if you don't have the rust toolchain installed. | ||
### Adding a new dbt command | ||
TODO | ||
|
||
## Future work | ||
- add more projects to test different configurations that have been known bottlenecks | ||
- add more dbt commands to measure | ||
- possibly using the uploaded json artifacts to store these results so they can be graphed over time | ||
- reading new metrics from a file so no one has to edit rust source to add them to the suite | ||
- instead of building the rust every time, we could publish and pull down the latest version. | ||
- instead of manually setting the baseline version of dbt to test, pull down the latest stable version as the baseline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! Confirming that we'll have this data saved to / available from S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not at first. the first version of this test suite is going to add a commit to the repository with a json file for the full modeling of each release. The samples are printed in github actions but ultimately not saved.
If we want any of the above to be available in s3 we can track that work with a separate ticket.