Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign how benchmarks work #982

Open
HarelM opened this issue Feb 10, 2022 · 24 comments
Open

Redesign how benchmarks work #982

HarelM opened this issue Feb 10, 2022 · 24 comments
Labels
benchmark Related to testing performance 💰 bounty L Large Bounty, USD 500 enhancement New feature or request PR is more than welcomed Extra attention is needed

Comments

@HarelM
Copy link
Collaborator

HarelM commented Feb 10, 2022

Motivation

Benchmarks are currently design in such a way that they require the developer to think about every change if it will introduce breaking changes in a way that requires code changes in the benchmark code.
In #447 it can be seen that there's a need to upload some build artifacts that will later be used to execute tests and compare timing.

The problem can be seen here too #979.

Design Alternatives

Make sure we don't break any thing every time?
Not a very good option.

Design

In general, the way I see it is that, the maplibre-gl file of each version is used to run the tests without bundeling it with all kind of files.
Assuming the public API doesn't change much this will allow testing different version using the same style/sources etc.
If the public API changes then probably a small wrapper can be used in the latest code to mitigate for the breaking changes. These breaking changes will also require a major version change so this would make sense to think about it too at that point.

Any ideas are welcome :-)

Acceptance criteria

  1. Benchmarks are running as part of the CI
  2. It's easy to test previous versions without the need to generate code per version and store it somewhere
  3. It doesn't have to be as elaborate as the manual benchmark, it can be simpler and can co-exist.
  4. Before implementation starts a design is presented and accepted
@github-actions
Copy link
Contributor

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Aug 10, 2022
@birkskyum
Copy link
Member

Not stale

@HarelM
Copy link
Collaborator Author

HarelM commented Aug 21, 2022

I'm not sure how to solve this.
Seems like we need a proper design for this...

@HarelM HarelM added enhancement New feature or request good first issue Good for newcomers PR is more than welcomed Extra attention is needed labels Aug 21, 2022
@birkskyum birkskyum added the benchmark Related to testing performance label Aug 30, 2022
@wipfli
Copy link
Contributor

wipfli commented Feb 1, 2023

@zhangyiatmicrosoft do you want to look into this? In particular, we should get to a state where the benchmarks run automatically on all pull requests and report performance impact of changes.

@birkskyum
Copy link
Member

birkskyum commented Feb 1, 2023

So, if I understand it correctly, having the benchmarks run directly off of a production build maplibregl.min.js instead of the benchmarks custom rollup config. I definitely agree that it would be a lot better.

It would be backwards too, because we have production build in cdnjs etc. for each version, so that is cool.

@rotu
Copy link
Contributor

rotu commented Feb 2, 2023

@birkskyum see

<!DOCTYPE html>
<html lang="en">
<head>
<title>Query Test Page</title>
<meta charset='utf-8'>
<link rel="icon" href="about:blank">
<style>#map {
box-sizing:content-box;
width:${options.width}px;
height:${options.height}px;
}</style>
</head>
<body id='map'></body>
</html>`);
await page.addScriptTag({path: 'dist/maplibre-gl.js'});
await page.addStyleTag({path: 'dist/maplibre-gl.css'});
const actual = await page.evaluate(performQueryOnFixture, fixture);

We can inject just the maplibre module instead of building a benchmark bundle.

I'm not sure how the current benchmarks deal with web workers though. It seems problematic that workers from different builds can compete with each other for CPU and may or may not spin down when the benchmark is done.

@HarelM
Copy link
Collaborator Author

HarelM commented Feb 6, 2023

Link to bounty:
maplibre/maplibre#192

@HarelM HarelM added 💰 bounty M Medium Bounty, USD 250 and removed good first issue Good for newcomers labels Feb 7, 2023
@HarelM HarelM added 💰 bounty L Large Bounty, USD 500 and removed 💰 bounty M Medium Bounty, USD 250 labels Apr 19, 2023
@msbarry
Copy link
Contributor

msbarry commented Jun 20, 2023

What do you think about having CI run benchmarks against the commit the PR was based off of?

I have something similar set up for planetiler, it builds a PR, builds the base that the PR was based off of, runs a performance test for each (one after the other) and posts the results in a comment to the PR like onthegomap/planetiler#607 (comment). The one problem with that setup is that performance of github action runner instances varies wildly between runs, and even within the same run - they must get CPU throttled or something. But if they are running shorter benchmarks on main vs. PR build sequentially they should be close enough to compare to eachother.

Although the images are nice, they aren't super practical for posting to a github PR since you'd need to post the image to some image hosting service or pollute the repo with branches for the images (see comment-webpage-screenshot). What do you think about compact text output like benchmark.js gives you (average time +/- variance)?

@msbarry
Copy link
Contributor

msbarry commented Jun 20, 2023

Spoke offline with @HarelM about this, sounds like something that ran as a unit test would be better just so CI verifies it's not broken, but then you need to run it manually to compare performance between 2 sha's.

@birkskyum
Copy link
Member

If the github runners are that unreliable, the local results are probably the only meaningful ones anyway.

One thing I'd love to have as part of a refactor is that the benchmark take in a built maplibre-gl.js and benchmark it, instead of having its own rollup.config.json etc. to make a new build that then will be tested. Usually benchmarks don't make their own custom builds.

@msbarry
Copy link
Contributor

msbarry commented Jun 20, 2023

Yeah I agree, local testing probably makes more sense. What if we had 2 sets of benchmarks, “integration benchmarks” testing the end to end performance that you can point at a minified build (which could come from a local build or cdnjs) and micro benchmarks that are more like unit tests and run from source?

@birkskyum
Copy link
Member

That sounds great to me. @HarelM , what do you think about it?

@HarelM
Copy link
Collaborator Author

HarelM commented Jun 20, 2023

The main focus from my point of view is to be able to run stuff as part of the CI and hopefully on cdn minified code. Stuff that are designed to be ran locally are nice for debugging etc.

@msbarry
Copy link
Contributor

msbarry commented Jun 21, 2023

Stepping back a bit here, I think the pieces we want in place for a comprehensive performance strategy are:

image
  1. End-user performance metrics Maplibre should make it easy for applications get high-level performance metrics like time-to-first-tile/speed index, and tile load/fetch/process times that they can use (or not use) to see the impact of maplibre performance improvements. This is driven by some things the application owner controls (cdn, tile server) but also things maplibre controls (bundle size, js performance)
  2. End-to-end benchmarks to measure those high-level metrics on a standard setup with representative map data and styles so we can be reasonably confident end-users will see improvements with a new release
  3. Profiling tools make it easier to debug performance issues and identify hot spots
  4. Microbenchmarks isolate hot spots so you can quickly test variations to improve it, which should generate gains visible in the profiling tools (3), which should improve the end-to-end benchmarks (2) and generate noticeable improvements to end-users in the wild (1).

Assuming people are on board with this high-level approach, the gaps I think we'd need to decide on a standard set of metrics to measure improvements by (time-to-first-tile, subsequent tile load times, and maybe something around UI thread jank?) and representative set of use-cases (styles, tile schemas, devices) for end-to-end benchmarks. Then we'd need to set up the e2e and micro benchmarking framework (this issue), profiling tools, and expose more detailed performance measurements to end-user applications that they can use for end-user performance measurement.

I realize this is a bit broader than this issue, let me know if there's a better place to discuss!

@msbarry
Copy link
Contributor

msbarry commented Jun 21, 2023

The main focus from my point of view is to be able to run stuff as part of the CI and hopefully on cdn minified code. Stuff that are designed to be ran locally are nice for debugging etc.

When you say "run on CI" do you mean verify they work on CI? Or get the actual measurements? Based on how inconsistent the runner performance is, it might be hard to use actual measurements without a dedicated remote machine to run on (@wipfli has looked into something similar with his hetzner box)

@HarelM
Copy link
Collaborator Author

HarelM commented Jun 21, 2023

I would hope that even if the results between two runs on different machines are not consistent, running some tests on the same machine in the same run will provide meaningful results.
If this is not the case I don't see how we track progress and improvements in an automated way, which makes the entire problem stay the way it is now - people are sometimes running those, it's hard to understand what are good vs bad results etc.
My main concern is the process it self, and how to make it easy to understand the results.
The distinctions above are great and I don't have anything against them, but I have a hard time understand how they improve our current situation...

@msbarry
Copy link
Contributor

msbarry commented Jun 21, 2023

Got it, I was thinking when you're making a performance-focused change, you can use the profiling tools to identify hot spots and microbenchmarks to help improve them, then use the end-to-end performance benchmarks locally to show the impact of the change and help justify any added code/complexity if necessary. I have a few improvements like this I want to make now, and trying to decide if I should help set up some common tooling first or just do 1-off stuff for those changes.

When you're making a non-performance-related change, at the very least CI runs the benchmarks to make sure you don't break them. Assuming we get consistent enough performance on github action runners would you rather:

  1. output the performance results, but you need to dig into the CI log output to see them?
  2. post a comment to the PR with performance test results like bundle size checker used to do?
  3. fail a CI job if performance degrades by more than x% like bundle size check does now?
  4. or report them in some other way?

Before a release someone can run the end-to-end benchmarks on the previous release and this pre-release to ensure there are no regressions, and ideally be able to include a performance improvement headliner in the release notes.

And post-release clients can use the end-user-performance metrics to get visibility into how things are performing in the wild, possibly identifying issues you can profile to submit a more targeted issue when there's a problem.

@HarelM
Copy link
Collaborator Author

HarelM commented Jun 22, 2023

The current problem, the way I see it, is that it's not clear, for me at least, what are "good" results vs "bad" results.
I can look at the currently provided graphs/table all day long but without a clear definition of pass/fail or improved by x%/degraded by x% it's hard to say something meaningful.
And if we can summarize it in a clear and precise way, why not simply define it in the CI. :-)

@msbarry
Copy link
Contributor

msbarry commented Jun 22, 2023

Yeah I get that the current tests are running each variation a bunch of times and trying to show the difference in distribution, but a lot of the current output distracts from that. What if they had some kind of output like js-framework-bench uses:

image

The colors, compact time +/- variance output, and descriptions of what the benchmarks do really help put what you're looking at in context (as long as you're not color blind). When run from the command line, maybe we could have the process exit with non-zero code if there's a big enough regression so it would show up as a failed check on the PR?

The overall approach that the benchmarks use doesn't seem bad - it's nice that the CLI version is already set up to run what you see in the browser in headless chrome. It sounds like we just want to change how they display output, not require separate build process/artifacts, and run from CI?

Regarding the build process. I think end-to-end benchmarks can be run against a standard maplibre-gl distribution since they only use the external API. But for microbenchmarks that test an internal function the options I can see are:

  1. only run microbenchmarks against the current version of the code, but then you'd need to compare manually
  2. build/run microbenchmarks from source at the 2 git shas you want to compare, but the build process gets complicated
  3. require maplibre builds to expose the function that's being benchmarked through the external maplibre/Map APIs, but then it pollutes that API a bit, and we can't test against versions before this change is made

Any thoughts on those options?

@msbarry
Copy link
Contributor

msbarry commented Jun 22, 2023

Also for the 2 other pieces that I mentioned (report more perfromance metrics at runtime, and profiling tools) can I just open up PRs to add those? Or do I need to submit a design/get more buy-in first?

@HarelM
Copy link
Collaborator Author

HarelM commented Jun 24, 2023

If the implementation is small and doesn't change things much you can simply open a PR. If the changes are drastic, then a design discussion is in place.

@HarelM
Copy link
Collaborator Author

HarelM commented Jun 24, 2023

All there options are not great, but I don't have a better one.
The problem with the internal API is that we need to keep old versions of a compiled code. This is what we do right now, but it's a hard concept to grasp and you can't break backwards compatible stuff and the benchmark code.
Public API polluting isn't great as well.
The only valid solution I can think of is maybe firing some events at certain times, its "less public" I think. Maybe by adding a configuration flag to raise those events can keep the public footprint to a minimum.

@msbarry
Copy link
Contributor

msbarry commented Jun 24, 2023

For the end-to-end tests (time to first tile, map load time, tile load time with various data and styles) it seems like the things we measure should be available externally to anyone using the library to enable applications to do end-user performance monitoring.

But for the microbenchmarks that test an internal function I also could go either way on exposing them through the public API, maybe in more of a hidden way (like prefixed with _) vs. building the microbenchmarks from source at the 2 git shas.

Who is accepting the design proposal? A maplibre steering meeting? Would it be sufficient for a design proposal to present the 2 options and get feedback on which people prefer?

@HarelM
Copy link
Collaborator Author

HarelM commented Jun 27, 2023

I usually approve design proposals when it comes to the technical things, if you need more buy-in or would like to involve more people you can use slack or the TSC meetings. Whatever works for you :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Related to testing performance 💰 bounty L Large Bounty, USD 500 enhancement New feature or request PR is more than welcomed Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants