Continuous Benchmarking #4687

jangorecki · 2020-08-25T10:41:56Z

As already mentioned in multiple issues and over email/slack, we need automated tests that will be able to track performance regression.
This issue is meant to define scope.

Related useful project is planned in conbench. Once it will be working, I think we should use it. Unfortunately it does not seem to happen anytime soon, or even in a more distant future.
Anyway, keeping scope minimal should make it easier to eventually move to conbench later on.
Another related work is my old project macrobenchmarking.
And recent draft PR #4517.

Scope

Dimensions by which we will track timings

environment (allow to lookup hardware configuration)
R version
git sha of data.table (lookup date and version)
benchmark script (probably fixed to benchmark.Rraw)
query
version of a query (in case we modify existing query for some reason)
description

Dimensions that for now I propose to not include in scope

operating system
linux kernal version
compiler
compiler version
R compilation flags
data.table compilation flags
metrics (memory usage, etc.)
datatable.optimize option
number of threads (?)
multiple runs of single query (?)

Challenges

Store timings

In current infrastructre we do not have any processes that appends artifacts (timings in context of CB). Each CB run has to store results somewhere and re-use them later on.

timings storage in csv for simplicity (?)

Signalling a regression

Should we compare only to the previous timings, or to an average timings from longer period (last month, etc.)?
What is the tolerance threshold? for cheap queries 5-25% variance will be common.
In case of exceeding threshold we may want to run benchmark few times and take an average to ensure before signalling regression. Then we need a second threshold for such average.

Environment

To reduce number of false regression signals we need to use private dedicated infrastructure.
Having dedicated machine may not be feasible, so we need to have a mechanism of signalling to jenkins (or other orchestration process) that particular machine is in use in an exclusive mode.

We may use the same machine that runs db-benchmark

Pipeline

In the most likely case of not having a dedicated machine, CB may ended up being queued for a longer while (up to multiple days). Therefore it make sense to have it in a separate pipeline rather than in our data.table GLCI. Such CB pipeline could be scheduled to run daily or weekly instead of running on each commit.

We could eventually move publishing artifacts (package, website, docker images) to dedicated daily pipeline. This is not strictly related to CB but would make it easier to publish CB results as well.

Versioning

Should all elements of CB be included as a part of data.table project? or a separate project
- test script inst/tests/benchmark.Rraw
- new function benchmark() that meant to be used like test(), and benchmark.data.table() to be used like test.data.table()
- any extra orchestration could be in .ci/

Example test cases

[[ on a list column by group [[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646
joining in a loop, order of different types join operation almost 2 times slower #3928
simple access DT[10L], DT[, 3L] Selecting from data.table by row is very slow #3735
use of .SD for many columns add timing test for many .SD cols #3797
calling setDT in a loop setDT could be much simpler #4476
multithreaded function calls by group DT[, uniqueN(a), by=b], should stress new throttle feature throttle threads for iterated small data tasks #4484
more cases in existing benchmark.Rraw file

The text was updated successfully, but these errors were encountered:

surenH2oai · 2020-08-25T11:53:16Z

Thanks @jangorecki This story need to be broken down into tasks in github to plan deliverables else hard to know progress.
Break it down as:
Automated provisioning of env for performance testing (via jenkins, gitlab etc. scripts and jobs)
regression DT Framework design and development
Automating the regression test (first test)
Automating additional tests into the framework

Ideally the framework should enable anyone to add new regression tests easily. So once the framework is in others can add the needed tests as need arises.

MichaelChirico · 2020-09-03T00:24:00Z

I think a simple single-threaded vs multithreaded test should be in scope (maybe 1 thread vs 2 or 4 or 8 threads. not sure what machine we'd be testing on)

assignUser · 2022-01-19T16:50:11Z

Hey! if you are still interested in CB I would like to mention {touchstone} which will run benchmarks and comment them on the PR after every push! We are getting ready to submit to rOpenSci and CRAN.
Now is a good time to test it, we would love some feedback from new users!

tdhock · 2023-10-03T22:06:57Z

continuous benchmarking is one goal of the NSF POSE project, that @DorisAmoakohene is working on.

assignUser · 2023-10-04T01:06:49Z

touchstone development has stalled a bit (more or less because it's working for the projects it is used in) but I have tentative plans to add an option to export results to conbench, which is active and being used by high profile projects such as Apache Arrow and Meta's Velox. The data export is done via JSON so it would be very easy to activate once a conbench server has been setup for dt (and I added the export function 😁)

tdhock · 2023-12-20T17:04:15Z

@Anirban166 please see animint/animint2#101 (comment) for an explanation of how to create fine grained token with permission to write a particular repo in the context of a github action for performance testing (idea is to write performance test plots which show up in PRs, in gh-pages branch of Rdatatable/performance_test_PR_comment_plots)

Anirban166 · 2024-02-27T18:38:53Z

With regards to the issue at hand, I've been working on identifying performance regressions from pull requests (taking into account historical regressions) via a GitHub Action that I created.

assignUser · 2024-02-27T19:31:16Z

@Anirban166 You might already be aware of this but a common issue with these type of workflows (do something and comment result on PR) is that the GITHUB_TOKEN for pull_request is different depending on the source of the PR. If the PR comes from the same repo (e.g. your test PR) the token has full rw permissions and has access to secrets. If the PR comes from a fork (which is the case for most PRs in OSS repos like dt) the token is restricted to read access and no secrets.

Commenting on a PR requires the pull_requests:write permission, which means the comment won't go through on fork PRs. There is a trigger that causes a workflow to run with elevated permissions (which should always be minimized with: permissions: ... on a PR but this has to be used with great caution and should never run code from the PR directly for security reasons.

I really wish there was a separate pull_request_comment permission to avoid this issue but currently that doesn't exist :(

Anirban166 · 2024-02-27T20:38:40Z

@Anirban166 You might already be aware of this but a common issue with these type of workflows (do something and comment result on PR) is that the GITHUB_TOKEN for pull_request is different depending on the source of the PR. If the PR comes from the same repo (e.g. your test PR) the token has full rw permissions and has access to secrets. If the PR comes from a fork (which is the case for most PRs in OSS repos like dt) the token is restricted to read access and no secrets.

Commenting on a PR requires the pull_requests:write permission, which means the comment won't go through on fork PRs. There is a trigger that causes a workflow to run with elevated permissions (which should always be minimized with: permissions: ... on a PR but this has to be used with great caution and should never run code from the PR directly for security reasons.

I really wish there was a separate pull_request_comment permission to avoid this issue but currently that doesn't exist :(

Ditto!
PR contributors (talking in terms of the general mass, and not someone within data.table's core members that have appropriate access permissions) won't be able to authenticate with a GITHUB_TOKEN for commenting on the PR due to lack of write permissions.

I'm aware and that's exactly what I've been telling @tdhock over the past few weeks as well. (In fact, I brought this up to him even before I started on the action)

I wish that too :(

At present, I'll be running the workflows on my repository until we get better ideas on that (and since the tests are to be updated from time to time, it does require some manual configuration). One way is for the user to create another branch (apart from their fork of data.table) and run it between those for comparison but that too would require the user to do a few extra steps and have some working knowledge of setting up the action, so probably not the most ideal approach if we were to leave it to them.

jangorecki added benchmark performance ci labels Aug 25, 2020

jangorecki linked a pull request Oct 9, 2020 that will close this issue

benchmark #4517

Draft

jangorecki mentioned this issue Oct 9, 2020

benchmark #4517

Draft

jangorecki mentioned this issue Oct 18, 2020

Use Rperform #2584

Closed

MichaelChirico mentioned this issue Oct 3, 2023

[.data.table is very slow with a single column #5650

Open

tdhock mentioned this issue Mar 14, 2024

Run GHA on Windows, drop Appveyor test #5998

Closed

Anirban166 mentioned this issue Apr 9, 2024

A GitHub Action to perform automated performance regression testing on pull requests #6065

Closed

tdhock mentioned this issue Apr 11, 2024

Added a GitHub Action to check for performance regressions within the incoming changes from PRs, along with two test cases based on historical regressions #6078

Merged

tdhock closed this as completed in #6078 Apr 16, 2024

Anirban166 mentioned this issue May 3, 2024

build matrix on multiple platforms? Anirban166/Autocomment-atime-results#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous Benchmarking #4687

Continuous Benchmarking #4687

jangorecki commented Aug 25, 2020 •

edited

Loading

surenH2oai commented Aug 25, 2020

MichaelChirico commented Sep 3, 2020

assignUser commented Jan 19, 2022 •

edited

Loading

tdhock commented Oct 3, 2023 •

edited

Loading

assignUser commented Oct 4, 2023 •

edited

Loading

tdhock commented Dec 20, 2023

Anirban166 commented Feb 27, 2024 •

edited

Loading

assignUser commented Feb 27, 2024

Anirban166 commented Feb 27, 2024

Continuous Benchmarking #4687

Continuous Benchmarking #4687

Comments

jangorecki commented Aug 25, 2020 • edited Loading

Scope

Dimensions by which we will track timings

Dimensions that for now I propose to not include in scope

Challenges

Store timings

Signalling a regression

Environment

Pipeline

Versioning

Example test cases

surenH2oai commented Aug 25, 2020

MichaelChirico commented Sep 3, 2020

assignUser commented Jan 19, 2022 • edited Loading

tdhock commented Oct 3, 2023 • edited Loading

assignUser commented Oct 4, 2023 • edited Loading

tdhock commented Dec 20, 2023

Anirban166 commented Feb 27, 2024 • edited Loading

assignUser commented Feb 27, 2024

Anirban166 commented Feb 27, 2024

jangorecki commented Aug 25, 2020 •

edited

Loading

assignUser commented Jan 19, 2022 •

edited

Loading

tdhock commented Oct 3, 2023 •

edited

Loading

assignUser commented Oct 4, 2023 •

edited

Loading

Anirban166 commented Feb 27, 2024 •

edited

Loading