-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A GitHub Action to perform automated performance regression testing on pull requests #6065
Comments
This is a great contribution, thanks for the detailed explanation. For the time that it takes, I believe ~15 minutes per run should be acceptable (for reference, other data.table GHA runs take 5-10 minutes). If it ends up being too much time, I think we can reduce the time it takes later by creating a docker container with all of the required packages (which should substantially reduce the installation time). I wonder if that sounds reasonable to @jangorecki who is most familiar with our CI setup? |
15 mins is fine, it needn't be a binding check. that way for simple PRs they can be merged quickly without waiting for the slower check to finish. If anything, I wonder if it should take longer to really draw out the types of performance regressions we're worried about. E.g. the duckdb benchmarks take at least that long to run a few queries. |
yes, agree, 15 min is fine |
Agree that really large data/timings can be used to see some of the performance regressions we're worried about, but actually the really remarkable thing about atime is that it allows us to detect a large number of relevant performance regressions, by looking at a range of small N. The trick is that atime keeps increasing N until we get to a time limit, for example the default is 0.01 seconds, and actually that is sufficient to see differences in time/memory in historical regressions. And it keeps the overall check time very small (0.01 seconds per timing * 10 timings per version * 7 versions = 0.7 seconds per performance test). If there are any performance regressions that require more time to test/see, we can increase the time limit on a per-test basis (say to from 0.01 to 0.1 or 1 seconds for some tests). And if we need even more time than that, we can try to think of other ways of testing (but probably wouldn't be feasible on github actions). |
I would suggest the marketplace version. |
and when you make the PR, please make it from Rdatatable repo (not your fork), you should have permission to create new branch. |
Yup, I will create a new one by branching off from master here (I already did that once before) Now that you mentioned this, I'd like to add here that since this workflow is going to be on the master branch, it will only get triggered for pull requests to the master branch from any other branch. |
Got to read everything fully now, thanks so much for the extremely well-documented issue+examples! I think we are well into MVP stage for this and look forward to the PR! We can continue to improve more as we go along. I can't say I understand well the tradeoffs of Marketplace vs. local version, so I defer to @tdhock. @tdhock are those confidence bands on the {atime} output graphs? One thing I have in mind is whether we'll be able to easily discern statistical noise from these plots. |
yes they are confidence bands, in fact the y axis says so "median line, quartiles band" and that is over 10 timings. |
@Anirban166 can you explain the benefits/drawbacks of the marketplace/local approaches? |
also thanks to @DorisAmoakohene who has created a database of historical regressions that we can use to seed the initial performance tests https://github.com/DorisAmoakohene/PerformanceTest_data.table |
Thanks Michael! Happy to hear that you found it well-documented.
Perfect! I'll be making the PR soon.
Agreed, and updates to the action aside, we can keep adding to the test cases via separate PRs. |
Using a version of my action from the Marketplace means that we don't have to edit the workflow here to integrate any changes to the code (may it be bug fixes or features extending this MVP) aside from updating the version number for new releases as and when necessary. Maintenance might be easier thus. Another slight benefit that I can think of is the small size of the workflow code and narrowed-down scope, which may be helpful if we were to integrate this into another workflow or make it a part of it. On the other hand, using a self-contained/local version might be better if someone wants to customize the logic here directly (becomes independent of me maintaining my repository for my action). Updating the action would be solely restricted to this repository then. (And I would presume anyone who likely modifies it in the future will first run tests in their own fork of Imo the difference between the two approaches is quite subtle and it's more of a preference or choice-based thing at the end instead of having concrete pros and cons. (I for one have no strong preference towards either one and that's why I asked for opinions to make a decision among two options that work but are confusing to select from) |
Thanks, that's pretty clear. I agree with Toby that the marketplace is the correct approach, especially given you've already built generality beyond data.table into the design:
|
In an effort to potentially address #4687, I created a GitHub Action to facilitate performance testing of the incoming changes that are introduced via pull requests to
data.table
. My motivation in doing so was to help ensure thatdata.table
maintains its code efficiency or high-performance standards (core values of the project!) as PRs keep coming and getting integrated frequently (meaning they need to be monitored for performance regressions, and an automatic way to do that would be ideal, noh?).Onto more details:
inst/atime/tests.R
) on different versions ofdata.table
. These 'tests' are based on historical regressions that have been documented/discussed via issues and PRs here.GITHUB_TOKEN
I provide to authenticate itself and interact with the GitHub API within the scope of this step that I defined in my GHA workflow).data.table
versions, the commit (SHA) that generated it, the link to download the artifact containing all of theatime
-generated results for that run, the time elapsed from the job's start to the completion of the installation steps, and the time it took to run all the tests.data.table
(as defined withinatime
) include:To demonstrate, I've created two examples on my fork of
data.table
that can be compared against each other:Both of them replicate PRs that are associated with historical
data.table
regressions (introducing/fixing one) and showcase the expected results in their respective plots:For the test case titled 'Test regression fixed in #5463' one can see that base, merge-base, CRAN, Fixed, and Before are fast while Regression and HEAD are slow for the first PR that replicates the regression introducing PR, and in the second PR where it should not be affected, all commits are fast with the expected exception of Regression.
For 'Test regression fixed in #4440', the first PR illustrates the reverse case with all commits being fast except the one associated with Regression (as expected since that PR should not affect that test case), and in the second PR where it replicates the regression fixing PR, we see that Fixed, HEAD, CRAN, and Before are fast while Regression, merge-base, and base are slow as we would expect.
Note that the first PR's GHA workflow uses my action from the Marketplace while the second one is run via a self-contained version of my workflow, or with the entire logic in the file under
.github/workflows/
within the target branch. (I made them different on purpose to illustrate the two ways!)So a question that I have here is: Which among these two approaches would be preferable if I make a PR to integrate my workflow?
The first approach allows for more direct customization while the second requires comparatively less maintenance (as I update things for my action, only the version numbers need to be changed here).
Another question I have is regarding the duration.
The total time taken to run the CI job can be roughly seen as an aggregation of two fairly time-consuming tasks:
atime
and its dependencies. Roughly 12 minutes on average.Is this acceptable in terms of how long it takes to run the workflow? (and I presume we can skip running the action always; for e.g., changes that only deal with documentation)
Finally, I want to mention that I made this action specifically for
data.table
(again, in hopes that this helps to detect regressions and maintain the quality of code contributions performance-wise!), but it also works on any other GitHub repository for an R package. For example, last month I made a PR between branches on my fork of binsegRcpp that introduces a performance regression (intentionally placedSys.sleep()
) in one of the R files, and it was depicted in the commented plot. (Note that it was run using an older version of my action's code)Ready for feedback!
The text was updated successfully, but these errors were encountered: