Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding benchmark workflows #923

Merged
merged 2 commits into from
Apr 30, 2021
Merged

Conversation

lipchev
Copy link
Collaborator

@lipchev lipchev commented Apr 30, 2021

  • updated the UnitsNet.Benchmark project (updated nugets & added frameworks)
  • added BenchmarkCategories to the existing benchmarks
  • added a few scripts for local testing (individual, multi-platform + rplots)
  • added an automatic benchmark workflow on pushing to [master] (some folders), running for ["netcoreapp50", "netcoreapp21", "net472"]
  • added workflows triggered on workflow_dispatch for running a benchmark (with options for comparing against a baseline)

Closes #920

First a few ground rules

  1. As of this writing, manually triggered workflows (a.k.a. workflow_dispatch), are only visible from the master branch.
  2. Composite actions have some nasty limitations
  3. The Continuous Benchmark needs to be in a context with a commit in order to work
  4. It publishes commits (with optional comments and alerts) targeting a given folder on a separate branch (typically gh-pages/benchmark/)- here is an example (all but the last benchmarks were run using a very-low iteration count- a sort of DryRun, so the wild fluctuations are expected). Here is another (that's consistently slower).
  5. It does not support multi-line charts (out-of-the-box)
  6. I read somewhere that the average performance of the virtual machines vary within about 10-15%
  7. We can run the benchmarks for any framework from 472 onward (windows required)
  8. We can easily configure all aspects of a benchmark using the command line arguments (there is also a dotnet benchmark tool but that seems to be doing the exact same things- although I suspect it's also going to get some 'compare results' functionality at some point which would likely remove the need for the next step)
  9. There are a few result-comparing extensions out there- but I most liked the one from dotnet/performance
  10. Just a few days ago, I knew almost none of that.

The workflows are well documented, I think you would easily figure out what's going on. Here are the highlights:

  1. The automatic workflow runs the benchmarks (multiple frameworks), for each one exporting the result to the gh-pages/benchmarks- one framework one sub-folder
  2. It is only supposed to run on the master branch & only on certain occasions (see here)
  3. For each folder (framework) we have the chart data: data.js & index.html (here we append results) and the results folder containing the latest benchmark results, as exported by BenchmarkDotNet (we override these each time the script is executed)
  4. We can then reference / compare results across branches & repositories (a fork doesn't have to have gh-pages in order to compare against upstream/gh-pages/benchmark/xxx/)
  5. This can be done using the second action: which can be used to generate a report (the typical results folder) which can then be compared against a baseline (this step only runs if the baseline parameter points to a valid "*-report-full.json")

PS Some of the links point to lipchev/UnitsNet-Benchmarks which is the repository I used for testing. Unfortunately something got messed up during the rebase so I made the PR in a new branch (of my main fork) - this (as expected) did not trigger a benchmark-run as I pushed on a non master branch. As to why the manual flow isn't showing up in the actions menu- see 1).

PS2 You should still have access to the test repository (lipchev/UnitsNet-Benchmarks) if you want to test out the workflows- feel free to commit on it if you want, I plan to delete it after this is merged.

lipchev added 2 commits April 30, 2021 06:46
- updated the UnitsNet.Benchmark project (updated nugets & added frameworks)
- added BenchmarkCategories to the existing benchmarks
- added a few scripts for local testing (individual, multi-platform + rplots)
- added an automatic benchmark workflow on pushing to [master] (some folders), running for ["netcoreapp50", "netcoreapp21", "net472"]
- added workflows triggered on workflow_dispatch for running a benchmark (with options for comparing against a baseline)
@angularsen
Copy link
Owner

What the hell did you do in these commits? 😆
image

@angularsen
Copy link
Owner

I mean, wow, great job and great outline! I love it ❤️
I say, pull the trigger!

Letting you merge it, in case there is anything else to do before it merges.

The only thing I'm curious about, is how do we best use this tool to monitor regressions over time?
If I understand correctly, this only runs on master so we can't easily use it when reviewin PRs. Is there any way to alert us to give a heads up if it regresses significantly, say 50% slower on some tests?

Also, are there any limits on the free plan for actions we need to know about?

@lipchev lipchev merged commit 0f39bec into angularsen:master Apr 30, 2021
@lipchev
Copy link
Collaborator Author

lipchev commented Apr 30, 2021

There it's on- the action is running

@lipchev
Copy link
Collaborator Author

lipchev commented Apr 30, 2021

Here is the free plan limits:

Job execution time - Each job in a workflow can run for up to 6 hours of execution time. If a job reaches this limit, the job is terminated and fails to complete.

Workflow run time - Each workflow run is limited to 72 hours. If a workflow run reaches this limit, the workflow run is cancelled.

I'm currently running the matrix in parallel- so no way we could reach 6 hours. I'm not yet sure if running the benchmarks in parallel is any worse (I assume we get a random machine assignment).

Also I've been conservative- only 3 frameworks, but we could just as well test only one (or maybe 5-6).

@lipchev
Copy link
Collaborator Author

lipchev commented Apr 30, 2021

As for testing the PR's- we wouldn't want to run the benchmarks, for something like adding a new quantity. As for the performance affecting PR's - the person proposing the modification (like you and me, or anyone else- even without privilages) can run the 'run-benchmarks' workflow on his branch (targeting one of framework at time*) using the baseline, that's available on your gh-pages branch (which the person would not need to pull or anything).

I'll try to give an example in a few minutes.

@angularsen
Copy link
Owner

Sounds perfect

@lipchev lipchev deleted the benchmark-workflows branch April 30, 2021 22:03
@lipchev
Copy link
Collaborator Author

lipchev commented May 1, 2021

Ok, here is an example run from my newly created fork that returns a new list for each unknown abbreviation:
image

Here is the action log (net50)

And here are the results:

Runtime=.NET Core 5.0 Toolchain=netcoreapp50

|          Method |     Mean |    Error |   StdDev |  Gen 0 |  Gen 1 | Gen 2 | Allocated |
|---------------- |---------:|---------:|---------:|-------:|-------:|------:|----------:|
|           Parse | 49.42 μs | 0.980 μs | 1.048 μs | 1.7700 | 0.0610 |     - |  32.56 KB |
|   TryParseValid | 49.98 μs | 0.736 μs | 0.653 μs | 1.7700 | 0.0610 |     - |  32.54 KB |
| TryParseInvalid | 54.05 μs | 1.061 μs | 1.090 μs | 1.7090 | 0.0610 |     - |  32.16 KB |

..and the results of the comparison to the baseline
* (not comparing the allocations, but if you open the base-line from the link above you should be able to spot the difference)

summary:
worse: 3, geomean: 1.072
total diff: 3

| Slower                                                | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ----------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| UnitsNet.Benchmark.UnitsNetBenchmarks.TryParseInvalid |      1.10 |         48927.15 |         53800.22 |         |
| UnitsNet.Benchmark.UnitsNetBenchmarks.TryParseValid   |      1.08 |         46153.61 |         49879.57 |         |
| UnitsNet.Benchmark.UnitsNetBenchmarks.Parse           |      1.04 |         47511.59 |         49251.15 |         |

No Faster results for the provided threshold = 0.001% and noise filter = 0.3ns.

@lipchev
Copy link
Collaborator Author

lipchev commented May 1, 2021

Ok I only needed to fix one little thing. This in turn triggered a re-run, so now we should get two points on the chart.

Overall the benchmarks appear to be quite noisy- some of the operations seem to be too short. For those cases microsoft is doing a bunch of operations in a loop. I'm thinking of stealing this test, and putting it side by side with ours (sort of like a baseline for the IComparable interface, for which there is currently no benchmark).

@lipchev
Copy link
Collaborator Author

lipchev commented May 1, 2021

It's worse than what I expected- I re-ran the workflow to see what a third point (of the same code-base) would look like, and it kinda looks like "the machine" was (almost) consistently slower/faster on each run (except for ToUnit on net472 where we seem to have an inflection, but that's probably due to the benchmark itself).

It would probably smooth out if I run it a couple of more times (at different times), but I'm not sure how useful it would be for comparing results between branches. Maybe if I run the tests for a longer period, we would get some multi-modal distribution that when averaged out would give us a sample of the performance on a typical VM.

Anyways, the workflows seem to be doing what their supposed to. I'm sure an automatic comparison workflow can be devised that executes [on: pull_request] but I don't think it worth it for now (the manual-way doesn't take that many steps anyway).

This would be pretty cool to see run inside a stable self-hosted runner.

@angularsen
Copy link
Owner

I guess it might be tricky to get stable numbers from a shared pool of worker VMs.

This one observed 10-20% fluctuation.
https://github.com/rhysd/github-action-benchmark#stability-of-virtual-environment

@lipchev
Copy link
Collaborator Author

lipchev commented May 7, 2021

Yeah, pretty useless for any between-commit comparisons (might look ok in the long run, though I think I'm going to disable the auto-comment feature)..

Also- I hadn't foreseen it triggering on release, but yeah- I guess the project file is inside the UnitsNet folder :)

Anyways, we got lucky this time- Intel Xeon CPU E5-2673 v3 2.40GHz appears to have almost the same performance as Intel Xeon Platinum 8171M CPU 2.60GHz

PS I'm playing with a little hack for re-scaling the results (only for the charts)- that would probably smooth out the lines a little.

PS1 There is a bug in the QuantityFrom benchmark (I've fixed it locally).

PS2 There are at least two performance improvements I'm going to PR for soon (so that this wouldn't have been a complete waste of time..).

@angularsen
Copy link
Owner

You are on 🔥 🤩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GitHub Action for Continuous Benchmarking
2 participants