Automation of benchmark comparison #292

fjetter · 2022-08-31T15:29:06Z

Requirements

Be able to compare two distinct git handles (commit, branch, etc.) of dask/dask and/or dask/distributed; needs to target forks
It is fine if it is merely possible to use a commit OR a branch OR a PR etc. instead of generic git refs
Tests are used from the current commit / PR of the coiled-runtime
Initial iteration must be extendable to compare N commits
One or many CI job(s) is spawned that runs all available benchmarks and collects a summary
The CI job collects all data that is also collected by the ordinary benchmark tests
The CI job produces a summary / report comparing these single point numbers

Nice to haves

Varying config options
Select a subset of the benchmarks to run
Varying environment
Varying test code for different git handles (if a prototype change breaks an API, for example)
We also want to collect memory sampling, performance reports, etc.

crusaderky · 2022-09-01T10:18:34Z

I went through the codebase. @jrbourbeau @ncclementi @ian-r-rose please correct me if I say something wrong or impossible.

The Tests github action does the following:

build the latest conda package for coiled-runtime
create coiled software environments for latest, 0.1.0, and 0.0.4, in cartesian product with different versions of python
run the full battery of tests against the coiled environments created at step 2
collect and compare results
destroy coiled software environments

If that's the case, my plan is fairly straightforward:

Create additional software environments
Create a new static HTML report to compare whatever was run in the battery of tests

Sample workflow

I want to test dask/distributed#6614.

I create a new branch of coiled-runtime, in my own fork: crusaderky/coiled-runtime/AB_6614

In the new branch, I create the following files. Each file is a bunch of kwargs to coiled.create_software_environment:

AB_parameters/6614_before.yaml

conda:
    - python=3.9
    - coiled-runtime=0.1.0
pip:
    - dask==2022.8.1
    - git+https://github.com/dask/distributed@f07f384eb0b8b6bad7518e8535e42fdba106ed4b

AB_parameters/6614_inf.yaml

conda:
    - python=3.9
    - coiled-runtime=0.1.0
pip:
    - dask==2022.8.1
    - git+https://github.com/dask/distributed@dd81b424971e81616e1a52fa09ce4698a5002d41

AB_parameters/6614_12.yaml

conda:
    - python=3.9
    - coiled-runtime=0.1.0
pip:
    - dask==2022.8.1
    - git+https://github.com/dask/distributed@dd81b424971e81616e1a52fa09ce4698a5002d41
environ:
    DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION: "1.2"

Notes

I could have used the nightly build of coiled-runtime instead
I did not install dask 2022.8.1 using conda, because it would have conflicted with coiled-runtime=0.1.0. However, using pip allows you to bypass the constraints imposed by the conda recipe.
I could have added a PR that hasn't been merged yet by simply creating a branch on my fork https://github.com/crusaderky/distributed

I then have to update .github/workflows/tests.yaml:

In the (not yet existing) section "additional_environments", I change

matrix:
  AB_environments: []

to

matrix:
  AB_environments:
    - 6614_before
    - 6614_inf
    - 6614_12

In the benchmarks, runtime, and stability sections, I update the matrix to whatever I'm interested in:
e.g. from

  benchmarks:
    ...
    strategy:
      fail-fast: false
      matrix:
        os: ["ubuntu-latest"]
        python-version: ["3.9"]
        runtime-version: ["latest", "0.0.4", "0.1.0"]

to

      matrix:
        os: ["ubuntu-latest"]
        python-version: ["3.9"]
        runtime-version: [AB_6614_before, AB_6614_inf, AB_6614_12]

Note: I would love to not need to touch tests.yaml and instead automatically generate those lists from a file. To my knowledge it's not possible but I will investigate.

Finally, I submit a PR named [DNM] A/B testing: distributed#6614 and read the report out of the PR's artifacts.
When I'm done I close the PR without merging.

The static report will show side by side the time, peak memory, and average memory for each test from each test suite that has run and has the measure.

Further improvements, out of scope for this ticket:

Requirements

All requirements in the OP are satisfied.

Nice to haves

Varying config options
Yep!
Select a subset of the benchmarks to run
You can completely disable benchmarks, runtime, or stability. You can additionally cherry-pick tests by tampering with the lines in tests.yaml that call pytest.
Varying environment
Yep!
Varying test code for different git handles (if a prototype change breaks an API, for example)
Yep!
We also want to collect memory sampling, performance reports, etc.
MemorySampler timeseries are definitely a possibility for a future iteration but are out of scope for now.
Automatically highlight major changes
Out of scope for now; something straightforward to implement in the future

Known issues and limitations

I don't know if it's straightforward, or even possible, to use the latest coiled-runtime (meaning, from the same PR) instead of the nightly build. This makes it harder to e.g. benchmark two different versions of a dependency where the pip package is not quite interchangeable with the conda one, like numpy or pandas.

fjetter · 2022-09-01T11:12:27Z

The Tests github action does the following:

the tests workflow is already pretty complicated. I wouldn't mind writing something new. If it actually is easier to develop within this file, I'm OK.
If we want to reuse steps, we can probably factor them out into a custom github action (afaik this can be done within the same repo) and use different yml files

Would love to get @ian-r-rose thoughts on this

I create a new branch of coiled-runtime, in my own fork: crusaderky/coiled-runtime/AB_6614

nit: if you are running on your fork, there will be no credentials. Right now, we have to run everything on the main repo. I don't think this impacts your proposal

Note: I would love to not need to touch tests.yaml and instead automatically generate those lists from a file. To my knowledge it's not possible but I will investigate.

Agreed but I think that's an OK workflow for now

fjetter · 2022-09-01T11:14:25Z

I don't know if it's straightforward, or even possible, to use the latest coiled-runtime (meaning, from the same PR) instead of the nightly build. This makes it harder to e.g. benchmark two different versions of a dependency where the pip package is not quite interchangeable with the conda one, like numpy or pandas.

Looks like the checkout action supports checking out anything you want. It just defaults to the current commit

https://github.com/actions/checkout

I suggest to do this in a follow up.

fjetter · 2022-09-01T11:16:19Z

Finally, I submit a PR named [DNM] A/B testing: distributed#6614 and read the report out of the PR's artifacts.
When I'm done I close the PR without merging.

Nice follow up: the CI job posts the final results to the PR and closes the PR automatically

ian-r-rose · 2022-09-01T14:50:58Z

I went through the codebase. @jrbourbeau @ncclementi @ian-r-rose please correct me if I say something wrong or impossible.

The Tests github action does the following:
1. build the `latest` conda package for `coiled-runtime`

2. create coiled software environments for `latest`, `0.1.0`, and `0.0.4`, in cartesian product with different versions of python

I would expect these to be much simpler once we merge #235 and live in a package_sync world. Then you would only have to worry about installing the correct environmnent on the CI instance, and the cluster will follow. I think it's probably not worth the time to generate new coiled software environments for every CI build right now.

3. run the full battery of tests against the coiled environments created at step 2

4. collect and compare results

No objection here :)

5. destroy coiled software environments

This would also be unnecessary in a package_sync situation.

If that's the case, my plan is fairly straightforward:

1. Create additional software environments

2. Create a new static HTML report to compare whatever was run in the battery of tests

On this front, I pushed some WIP that I have found useful for looking at regression in #294. You might be able to use that with minimal changes (it could probably use a bit of CLI work).

ian-r-rose · 2022-09-01T14:53:26Z

the tests workflow is already pretty complicated. I wouldn't mind writing something new. If it actually is easier to develop within this file, I'm OK.

I agree that it's quite complicated. However, there are two things that would simplify it dramatically:

We merge Package Sync #235, which already removes several hundred lines of yaml, and
We adopt the proposal in Simplify CI test workflow #279 and just have a single "test" job

Would that allay your concerns about having something in tests.yml?

ncclementi · 2022-09-01T14:59:02Z

@ian-r-rose beat me to it, but I was about to say just that. I think once Package sync is in and we simplify CI, things should get much simpler to work with.

ian-r-rose · 2022-09-01T15:19:09Z

@crusaderky I'd recommend holding off on making major time investments into changing the software environment setup process until #235 is in. Or building on top of that PR.

crusaderky · 2022-09-01T15:26:02Z

I started writing #296. I'll pause and wait for #235 to go in first. Do you have a rough ETA?

ian-r-rose · 2022-09-01T15:29:09Z

Do you have a rough ETA?

I'd like to get it in today or tomorrow. Before we merge it, do you have any concerns about using package_sync for this effort?

gjoseph92 · 2022-09-01T19:10:31Z

Varying test code for different git handles (if a prototype change breaks an API, for example)

How is this supported? Let's say I wanted to test dask/dask#9446 while it's still a prototype PR.

I want to run 3 cases:

Low cardinality, against the PR:

def test_cardinality(small_cluster):
    ...
    result_A = dfA.groupby("x", cardinality=0.1).mean()
    result_B = dfB.groupby("x", cardinality=0.3).mean()

High cardinality, against the PR:

def test_cardinality(small_cluster):
    ...
    result_A = dfA.groupby("x", cardinality=0.8).mean()
    result_B = dfB.groupby("x", cardinality=0.9).mean()

Default, against main (which doesn't support the cardinality argument):

def test_cardinality(small_cluster):
    ...
    result_A = dfA.groupby("x").mean()
    result_B = dfB.groupby("x").mean()

How would you express the fact that different test code needs to run in different cases? With this approach, it seems like I'd have to add some way into my dask fork to express the two different cardinalities via environment variables.

Or another case: what if I wanted to benchmark the above code, also on 3 different cluster sizes?

gjoseph92 · 2022-09-01T20:54:01Z

Another way of approaching this would be to have the ability to compare across different branches of the coiled-runtime repo itself.

You'd create a branch for each variation you want to test. In that branch, you can change the environment.yaml however you wish to use different dependencies for that case; package_sync will take care of installing them.

You also can change all code however you need to between branches. This allows you to:

set different cluster environment variables/configs in different cases
use different cluster sizes in different cases
use different driver test code in different cases (covering the cardinality example above)

Pros:

extremely flexible
lightweight

Cons:

extremely flexible
lightweight

The downside, of course, is that you don't have this nice YAML file describing exactly what's different between the cases, which could be nice for dashboarding. But inevitably, that YAML is going to have to grow and grow to encompass more types of things you want to compare, and eventually require its own documentation and tests. (The alternative is to push all that complexity into the forked code you're running, which I think is a bad idea. Maintaining a branch of a branch, which adds hacks to alter behavior based on environment variables, would be brittle and frustrating.)

Making the changes in different branches could become tedious, especially with a large parameter space. I could certainly see making a script which takes a YAML configuration file and sets up all the branches for you :)

There's a question of whether we want to have thousands of branches in this repo. (We certainly wouldn't open PRs for all of them—just pushing would have to be sufficient.) Creating a shared git config to ignore fetching refspecs following a pattern (like runbench/*) would alleviate any eventual git performance issue here.

The workflow would probably be something like:

Create and push your branches. Don't open PRs for them.
Create another branch off main named compare/<name>, add a bench-compare.txt file containing the list of branch names (or better, commit hashes) you want to compare. Open a PR for this. Two options for what this triggers:
1. This kicks off a workflow which runs the benchmarks for all the listed branches in parallel (in separate actions). It collects the results and renders the comparison dashboard
2. Pushing the other branches already triggered workflows for them (by matching the runbench/* branch naming pattern). This workflow waits until they're done, collects their artifacts, and renders the comparison dashboard.
  
  If it's feasible, I prefer this approach, because it lets you:
  1. quickly add new cases without re-running everything
  2. quickly fix an error in just one case
  3. instantly compare existing cases—or turn two different commits of the same case into different cases
  4. easily run the dashboard rendering script locally (just point it at the GHA URLs for your branches)
Update the branches as necessary. Change the commit hashes in bench-compare.txt and push updates to compare/<name>.
Close the PR without merging once satisfied.

crusaderky · 2022-09-02T11:18:43Z

How would you express the fact that different test code needs to run in different cases? With this approach, it seems like I'd have to add some way into my dask fork to express the two different cardinalities via environment variables.

You'd do like you always do when running tests against different versions of a dependency:

@pytest.mark.skipif(dask.__version__ < (whatever is generated by your commit))

Or another case: what if I wanted to benchmark the above code, also on 3 different cluster sizes?

Cluster size is hardcoded in the tests, and it varies from test to test, so you can't. In a future iteration we can revisit the tests so that they use an environment variable whenever sensible.

Another way of approaching this would be to have the ability to compare across different branches of the coiled-runtime repo itself.

This feels extremely complicated to me. So you'd have a "lead" PR which contains a list of git handles, and get everything from git twice?
It would also mean not having the performance comparison between various coiled runtime versions (not A/B testing), which is a really nice report to have.

fjetter · 2022-09-02T11:30:57Z

How is this supported? Let's say I wanted to test dask/dask#9446 while it's still a prototype PR.

E.g. Use a branch in dask that hard codes a value or uses environment variables / configs to read this out. Solving such a problem specifically for the problem at hand is much simpler than figuring out a generic infrastructure solution that can handle everything

Another way of approaching this would be to have the ability to compare across different branches of the coiled-runtime repo itself.

You'd create a branch for each variation you want to test. In that branch, you can change the environment.yaml however you wish to use different dependencies for that case; package_sync will take care of installing them.

I'm with Guido on this. Let's keep our first iteration simple. Once it is up and running we evaluate what features are still missing.

ian-r-rose mentioned this issue Sep 1, 2022

WIP barchart for comparing different commits rather than a timeseries. #294

Closed

crusaderky changed the title ~~[Draft/WIP] Automation of benchmark comparison~~ Automation of benchmark comparison Sep 1, 2022

crusaderky mentioned this issue Sep 1, 2022

[WIP] Run test suite on arbitrary coiled environments #296

Closed

crusaderky self-assigned this Sep 1, 2022

gjoseph92 mentioned this issue Sep 1, 2022

Consider splitting integration tests and benchmarks into different repo #298

Open

This was referenced Sep 6, 2022

A/B test report #307

Closed

A/B test reports #309

Merged

A/B test jobs #314

Merged

crusaderky closed this as completed in #309 Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automation of benchmark comparison #292

Automation of benchmark comparison #292

fjetter commented Aug 31, 2022 •

edited by gjoseph92

Loading

crusaderky commented Sep 1, 2022 •

edited

Loading

fjetter commented Sep 1, 2022

fjetter commented Sep 1, 2022 •

edited

Loading

fjetter commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

ncclementi commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

crusaderky commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

gjoseph92 commented Sep 1, 2022 •

edited

Loading

gjoseph92 commented Sep 1, 2022

crusaderky commented Sep 2, 2022 •

edited

Loading

fjetter commented Sep 2, 2022 •

edited

Loading

Automation of benchmark comparison #292

Automation of benchmark comparison #292

Comments

fjetter commented Aug 31, 2022 • edited by gjoseph92 Loading

Requirements

Nice to haves

crusaderky commented Sep 1, 2022 • edited Loading

Sample workflow

AB_parameters/6614_before.yaml

AB_parameters/6614_inf.yaml

AB_parameters/6614_12.yaml

Notes

Requirements

Nice to haves

Known issues and limitations

fjetter commented Sep 1, 2022

fjetter commented Sep 1, 2022 • edited Loading

fjetter commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

ncclementi commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

crusaderky commented Sep 1, 2022

ian-r-rose commented Sep 1, 2022

gjoseph92 commented Sep 1, 2022 • edited Loading

gjoseph92 commented Sep 1, 2022

crusaderky commented Sep 2, 2022 • edited Loading

fjetter commented Sep 2, 2022 • edited Loading

fjetter commented Aug 31, 2022 •

edited by gjoseph92

Loading

crusaderky commented Sep 1, 2022 •

edited

Loading

fjetter commented Sep 1, 2022 •

edited

Loading

gjoseph92 commented Sep 1, 2022 •

edited

Loading

crusaderky commented Sep 2, 2022 •

edited

Loading

fjetter commented Sep 2, 2022 •

edited

Loading