-
-
Notifications
You must be signed in to change notification settings - Fork 739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign how benchmarks work #982
Comments
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
Not stale |
I'm not sure how to solve this. |
@zhangyiatmicrosoft do you want to look into this? In particular, we should get to a state where the benchmarks run automatically on all pull requests and report performance impact of changes. |
So, if I understand it correctly, having the benchmarks run directly off of a production build maplibregl.min.js instead of the benchmarks custom rollup config. I definitely agree that it would be a lot better. It would be backwards too, because we have production build in cdnjs etc. for each version, so that is cool. |
@birkskyum see maplibre-gl-js/test/integration/query/query.test.ts Lines 169 to 185 in 921ee79
We can inject just the maplibre module instead of building a benchmark bundle. I'm not sure how the current benchmarks deal with web workers though. It seems problematic that workers from different builds can compete with each other for CPU and may or may not spin down when the benchmark is done. |
Link to bounty: |
What do you think about having CI run benchmarks against the commit the PR was based off of? I have something similar set up for planetiler, it builds a PR, builds the base that the PR was based off of, runs a performance test for each (one after the other) and posts the results in a comment to the PR like onthegomap/planetiler#607 (comment). The one problem with that setup is that performance of github action runner instances varies wildly between runs, and even within the same run - they must get CPU throttled or something. But if they are running shorter benchmarks on main vs. PR build sequentially they should be close enough to compare to eachother. Although the images are nice, they aren't super practical for posting to a github PR since you'd need to post the image to some image hosting service or pollute the repo with branches for the images (see comment-webpage-screenshot). What do you think about compact text output like benchmark.js gives you (average time +/- variance)? |
Spoke offline with @HarelM about this, sounds like something that ran as a unit test would be better just so CI verifies it's not broken, but then you need to run it manually to compare performance between 2 sha's. |
If the github runners are that unreliable, the local results are probably the only meaningful ones anyway. One thing I'd love to have as part of a refactor is that the benchmark take in a built maplibre-gl.js and benchmark it, instead of having its own rollup.config.json etc. to make a new build that then will be tested. Usually benchmarks don't make their own custom builds. |
Yeah I agree, local testing probably makes more sense. What if we had 2 sets of benchmarks, “integration benchmarks” testing the end to end performance that you can point at a minified build (which could come from a local build or cdnjs) and micro benchmarks that are more like unit tests and run from source? |
That sounds great to me. @HarelM , what do you think about it? |
The main focus from my point of view is to be able to run stuff as part of the CI and hopefully on cdn minified code. Stuff that are designed to be ran locally are nice for debugging etc. |
Stepping back a bit here, I think the pieces we want in place for a comprehensive performance strategy are:
Assuming people are on board with this high-level approach, the gaps I think we'd need to decide on a standard set of metrics to measure improvements by (time-to-first-tile, subsequent tile load times, and maybe something around UI thread jank?) and representative set of use-cases (styles, tile schemas, devices) for end-to-end benchmarks. Then we'd need to set up the e2e and micro benchmarking framework (this issue), profiling tools, and expose more detailed performance measurements to end-user applications that they can use for end-user performance measurement. I realize this is a bit broader than this issue, let me know if there's a better place to discuss! |
When you say "run on CI" do you mean verify they work on CI? Or get the actual measurements? Based on how inconsistent the runner performance is, it might be hard to use actual measurements without a dedicated remote machine to run on (@wipfli has looked into something similar with his hetzner box) |
I would hope that even if the results between two runs on different machines are not consistent, running some tests on the same machine in the same run will provide meaningful results. |
Got it, I was thinking when you're making a performance-focused change, you can use the profiling tools to identify hot spots and microbenchmarks to help improve them, then use the end-to-end performance benchmarks locally to show the impact of the change and help justify any added code/complexity if necessary. I have a few improvements like this I want to make now, and trying to decide if I should help set up some common tooling first or just do 1-off stuff for those changes. When you're making a non-performance-related change, at the very least CI runs the benchmarks to make sure you don't break them. Assuming we get consistent enough performance on github action runners would you rather:
Before a release someone can run the end-to-end benchmarks on the previous release and this pre-release to ensure there are no regressions, and ideally be able to include a performance improvement headliner in the release notes. And post-release clients can use the end-user-performance metrics to get visibility into how things are performing in the wild, possibly identifying issues you can profile to submit a more targeted issue when there's a problem. |
The current problem, the way I see it, is that it's not clear, for me at least, what are "good" results vs "bad" results. |
Yeah I get that the current tests are running each variation a bunch of times and trying to show the difference in distribution, but a lot of the current output distracts from that. What if they had some kind of output like js-framework-bench uses: The colors, compact time +/- variance output, and descriptions of what the benchmarks do really help put what you're looking at in context (as long as you're not color blind). When run from the command line, maybe we could have the process exit with non-zero code if there's a big enough regression so it would show up as a failed check on the PR? The overall approach that the benchmarks use doesn't seem bad - it's nice that the CLI version is already set up to run what you see in the browser in headless chrome. It sounds like we just want to change how they display output, not require separate build process/artifacts, and run from CI? Regarding the build process. I think end-to-end benchmarks can be run against a standard maplibre-gl distribution since they only use the external API. But for microbenchmarks that test an internal function the options I can see are:
Any thoughts on those options? |
Also for the 2 other pieces that I mentioned (report more perfromance metrics at runtime, and profiling tools) can I just open up PRs to add those? Or do I need to submit a design/get more buy-in first? |
If the implementation is small and doesn't change things much you can simply open a PR. If the changes are drastic, then a design discussion is in place. |
All there options are not great, but I don't have a better one. |
For the end-to-end tests (time to first tile, map load time, tile load time with various data and styles) it seems like the things we measure should be available externally to anyone using the library to enable applications to do end-user performance monitoring. But for the microbenchmarks that test an internal function I also could go either way on exposing them through the public API, maybe in more of a hidden way (like prefixed with Who is accepting the design proposal? A maplibre steering meeting? Would it be sufficient for a design proposal to present the 2 options and get feedback on which people prefer? |
I usually approve design proposals when it comes to the technical things, if you need more buy-in or would like to involve more people you can use slack or the TSC meetings. Whatever works for you :-) |
Motivation
Benchmarks are currently design in such a way that they require the developer to think about every change if it will introduce breaking changes in a way that requires code changes in the benchmark code.
In #447 it can be seen that there's a need to upload some build artifacts that will later be used to execute tests and compare timing.
The problem can be seen here too #979.
Design Alternatives
Make sure we don't break any thing every time?
Not a very good option.
Design
In general, the way I see it is that, the maplibre-gl file of each version is used to run the tests without bundeling it with all kind of files.
Assuming the public API doesn't change much this will allow testing different version using the same style/sources etc.
If the public API changes then probably a small wrapper can be used in the latest code to mitigate for the breaking changes. These breaking changes will also require a major version change so this would make sense to think about it too at that point.
Any ideas are welcome :-)
Acceptance criteria
The text was updated successfully, but these errors were encountered: