Redesign how benchmarks work #982

HarelM · 2022-02-10T18:32:51Z

Motivation

Benchmarks are currently design in such a way that they require the developer to think about every change if it will introduce breaking changes in a way that requires code changes in the benchmark code.
In #447 it can be seen that there's a need to upload some build artifacts that will later be used to execute tests and compare timing.

The problem can be seen here too #979.

Design Alternatives

Make sure we don't break any thing every time?
Not a very good option.

Design

In general, the way I see it is that, the maplibre-gl file of each version is used to run the tests without bundeling it with all kind of files.
Assuming the public API doesn't change much this will allow testing different version using the same style/sources etc.
If the public API changes then probably a small wrapper can be used in the latest code to mitigate for the breaking changes. These breaking changes will also require a major version change so this would make sense to think about it too at that point.

Any ideas are welcome :-)

Acceptance criteria

Benchmarks are running as part of the CI
It's easy to test previous versions without the need to generate code per version and store it somewhere
It doesn't have to be as elaborate as the manual benchmark, it can be simpler and can co-exist.
Before implementation starts a design is presented and accepted

github-actions · 2022-08-10T02:07:58Z

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 30 days.

birkskyum · 2022-08-10T08:17:06Z

Not stale

HarelM · 2022-08-21T13:32:54Z

I'm not sure how to solve this.
Seems like we need a proper design for this...

wipfli · 2023-02-01T15:40:15Z

@zhangyiatmicrosoft do you want to look into this? In particular, we should get to a state where the benchmarks run automatically on all pull requests and report performance impact of changes.

birkskyum · 2023-02-01T15:50:07Z

So, if I understand it correctly, having the benchmarks run directly off of a production build maplibregl.min.js instead of the benchmarks custom rollup config. I definitely agree that it would be a lot better.

It would be backwards too, because we have production build in cdnjs etc. for each version, so that is cool.

rotu · 2023-02-02T19:11:17Z

@birkskyum see

maplibre-gl-js/test/integration/query/query.test.ts

Lines 169 to 185 in 921ee79

    
           <!DOCTYPE html> 
        
           <html lang="en"> 
        
           <head> 
        
               <title>Query Test Page</title> 
        
               <meta charset='utf-8'> 
        
               <link rel="icon" href="about:blank"> 
        
               <style>#map { 
        
                   box-sizing:content-box; 
        
                   width:${options.width}px; 
        
                   height:${options.height}px; 
        
               }</style> 
        
           </head> 
        
           <body id='map'></body> 
        
           </html>`); 
        
                       await page.addScriptTag({path: 'dist/maplibre-gl.js'}); 
        
                       await page.addStyleTag({path: 'dist/maplibre-gl.css'}); 
        
                       const actual = await page.evaluate(performQueryOnFixture, fixture);

We can inject just the maplibre module instead of building a benchmark bundle.

I'm not sure how the current benchmarks deal with web workers though. It seems problematic that workers from different builds can compete with each other for CPU and may or may not spin down when the benchmark is done.

HarelM · 2023-02-06T21:32:19Z

Link to bounty:
maplibre/maplibre#192

msbarry · 2023-06-20T09:58:40Z

What do you think about having CI run benchmarks against the commit the PR was based off of?

I have something similar set up for planetiler, it builds a PR, builds the base that the PR was based off of, runs a performance test for each (one after the other) and posts the results in a comment to the PR like onthegomap/planetiler#607 (comment). The one problem with that setup is that performance of github action runner instances varies wildly between runs, and even within the same run - they must get CPU throttled or something. But if they are running shorter benchmarks on main vs. PR build sequentially they should be close enough to compare to eachother.

Although the images are nice, they aren't super practical for posting to a github PR since you'd need to post the image to some image hosting service or pollute the repo with branches for the images (see comment-webpage-screenshot). What do you think about compact text output like benchmark.js gives you (average time +/- variance)?

msbarry · 2023-06-20T11:41:13Z

Spoke offline with @HarelM about this, sounds like something that ran as a unit test would be better just so CI verifies it's not broken, but then you need to run it manually to compare performance between 2 sha's.

birkskyum · 2023-06-20T11:49:20Z

If the github runners are that unreliable, the local results are probably the only meaningful ones anyway.

One thing I'd love to have as part of a refactor is that the benchmark take in a built maplibre-gl.js and benchmark it, instead of having its own rollup.config.json etc. to make a new build that then will be tested. Usually benchmarks don't make their own custom builds.

msbarry · 2023-06-20T12:09:25Z

Yeah I agree, local testing probably makes more sense. What if we had 2 sets of benchmarks, “integration benchmarks” testing the end to end performance that you can point at a minified build (which could come from a local build or cdnjs) and micro benchmarks that are more like unit tests and run from source?

birkskyum · 2023-06-20T12:12:28Z

That sounds great to me. @HarelM , what do you think about it?

HarelM · 2023-06-20T14:03:33Z

The main focus from my point of view is to be able to run stuff as part of the CI and hopefully on cdn minified code. Stuff that are designed to be ran locally are nice for debugging etc.

msbarry · 2023-06-21T00:49:15Z

Stepping back a bit here, I think the pieces we want in place for a comprehensive performance strategy are:

End-user performance metrics Maplibre should make it easy for applications get high-level performance metrics like time-to-first-tile/speed index, and tile load/fetch/process times that they can use (or not use) to see the impact of maplibre performance improvements. This is driven by some things the application owner controls (cdn, tile server) but also things maplibre controls (bundle size, js performance)
End-to-end benchmarks to measure those high-level metrics on a standard setup with representative map data and styles so we can be reasonably confident end-users will see improvements with a new release
Profiling tools make it easier to debug performance issues and identify hot spots
Microbenchmarks isolate hot spots so you can quickly test variations to improve it, which should generate gains visible in the profiling tools (3), which should improve the end-to-end benchmarks (2) and generate noticeable improvements to end-users in the wild (1).

Assuming people are on board with this high-level approach, the gaps I think we'd need to decide on a standard set of metrics to measure improvements by (time-to-first-tile, subsequent tile load times, and maybe something around UI thread jank?) and representative set of use-cases (styles, tile schemas, devices) for end-to-end benchmarks. Then we'd need to set up the e2e and micro benchmarking framework (this issue), profiling tools, and expose more detailed performance measurements to end-user applications that they can use for end-user performance measurement.

I realize this is a bit broader than this issue, let me know if there's a better place to discuss!

msbarry · 2023-06-21T01:24:15Z

The main focus from my point of view is to be able to run stuff as part of the CI and hopefully on cdn minified code. Stuff that are designed to be ran locally are nice for debugging etc.

When you say "run on CI" do you mean verify they work on CI? Or get the actual measurements? Based on how inconsistent the runner performance is, it might be hard to use actual measurements without a dedicated remote machine to run on (@wipfli has looked into something similar with his hetzner box)

HarelM · 2023-06-21T06:42:46Z

I would hope that even if the results between two runs on different machines are not consistent, running some tests on the same machine in the same run will provide meaningful results.
If this is not the case I don't see how we track progress and improvements in an automated way, which makes the entire problem stay the way it is now - people are sometimes running those, it's hard to understand what are good vs bad results etc.
My main concern is the process it self, and how to make it easy to understand the results.
The distinctions above are great and I don't have anything against them, but I have a hard time understand how they improve our current situation...

msbarry · 2023-06-21T12:25:27Z

Got it, I was thinking when you're making a performance-focused change, you can use the profiling tools to identify hot spots and microbenchmarks to help improve them, then use the end-to-end performance benchmarks locally to show the impact of the change and help justify any added code/complexity if necessary. I have a few improvements like this I want to make now, and trying to decide if I should help set up some common tooling first or just do 1-off stuff for those changes.

When you're making a non-performance-related change, at the very least CI runs the benchmarks to make sure you don't break them. Assuming we get consistent enough performance on github action runners would you rather:

output the performance results, but you need to dig into the CI log output to see them?
post a comment to the PR with performance test results like bundle size checker used to do?
fail a CI job if performance degrades by more than x% like bundle size check does now?
or report them in some other way?

Before a release someone can run the end-to-end benchmarks on the previous release and this pre-release to ensure there are no regressions, and ideally be able to include a performance improvement headliner in the release notes.

And post-release clients can use the end-user-performance metrics to get visibility into how things are performing in the wild, possibly identifying issues you can profile to submit a more targeted issue when there's a problem.

HarelM · 2023-06-22T06:16:44Z

The current problem, the way I see it, is that it's not clear, for me at least, what are "good" results vs "bad" results.
I can look at the currently provided graphs/table all day long but without a clear definition of pass/fail or improved by x%/degraded by x% it's hard to say something meaningful.
And if we can summarize it in a clear and precise way, why not simply define it in the CI. :-)

msbarry · 2023-06-22T10:00:08Z

Yeah I get that the current tests are running each variation a bunch of times and trying to show the difference in distribution, but a lot of the current output distracts from that. What if they had some kind of output like js-framework-bench uses:

The colors, compact time +/- variance output, and descriptions of what the benchmarks do really help put what you're looking at in context (as long as you're not color blind). When run from the command line, maybe we could have the process exit with non-zero code if there's a big enough regression so it would show up as a failed check on the PR?

The overall approach that the benchmarks use doesn't seem bad - it's nice that the CLI version is already set up to run what you see in the browser in headless chrome. It sounds like we just want to change how they display output, not require separate build process/artifacts, and run from CI?

Regarding the build process. I think end-to-end benchmarks can be run against a standard maplibre-gl distribution since they only use the external API. But for microbenchmarks that test an internal function the options I can see are:

only run microbenchmarks against the current version of the code, but then you'd need to compare manually
build/run microbenchmarks from source at the 2 git shas you want to compare, but the build process gets complicated
require maplibre builds to expose the function that's being benchmarked through the external maplibre/Map APIs, but then it pollutes that API a bit, and we can't test against versions before this change is made

Any thoughts on those options?

msbarry · 2023-06-22T10:02:24Z

Also for the 2 other pieces that I mentioned (report more perfromance metrics at runtime, and profiling tools) can I just open up PRs to add those? Or do I need to submit a design/get more buy-in first?

HarelM · 2023-06-24T10:58:16Z

If the implementation is small and doesn't change things much you can simply open a PR. If the changes are drastic, then a design discussion is in place.

HarelM · 2023-06-24T11:09:28Z

All there options are not great, but I don't have a better one.
The problem with the internal API is that we need to keep old versions of a compiled code. This is what we do right now, but it's a hard concept to grasp and you can't break backwards compatible stuff and the benchmark code.
Public API polluting isn't great as well.
The only valid solution I can think of is maybe firing some events at certain times, its "less public" I think. Maybe by adding a configuration flag to raise those events can keep the public footprint to a minimum.

msbarry · 2023-06-24T13:14:56Z

For the end-to-end tests (time to first tile, map load time, tile load time with various data and styles) it seems like the things we measure should be available externally to anyone using the library to enable applications to do end-user performance monitoring.

But for the microbenchmarks that test an internal function I also could go either way on exposing them through the public API, maybe in more of a hidden way (like prefixed with _) vs. building the microbenchmarks from source at the 2 git shas.

Who is accepting the design proposal? A maplibre steering meeting? Would it be sufficient for a design proposal to present the 2 options and get feedback on which people prefer?

HarelM · 2023-06-27T05:57:52Z

I usually approve design proposals when it comes to the technical things, if you need more buy-in or would like to involve more people you can use slack or the TSC meetings. Whatever works for you :-)

This was referenced Feb 10, 2022

Move bench under test folder #979

Merged

WorkerTransfer benchmark 3x slower on main compared to v2.1.1 #981

Closed

wipfli mentioned this issue Mar 9, 2022

Fix benchmark paths #1079

Closed

9 tasks

github-actions bot added the stale label Aug 10, 2022

birkskyum removed the stale label Aug 10, 2022

birkskyum mentioned this issue Aug 13, 2022

First column in benchmark is faster than the rest #1486

Open

HarelM added enhancement New feature or request good first issue Good for newcomers PR is more than welcomed Extra attention is needed labels Aug 21, 2022

HarelM mentioned this issue Aug 23, 2022

Add descriptions to benchmarks #1543

Open

birkskyum added the benchmark Related to testing performance label Aug 30, 2022

HarelM mentioned this issue Oct 6, 2022

Layout benchmark broken (again) #1709

Closed

HarelM mentioned this issue Feb 2, 2023

Error running performance benchmark #2122

Closed

HarelM added 💰 bounty M Medium Bounty, USD 250 and removed good first issue Good for newcomers labels Feb 7, 2023

HarelM added 💰 bounty L Large Bounty, USD 500 and removed 💰 bounty M Medium Bounty, USD 250 labels Apr 19, 2023

msbarry mentioned this issue Aug 9, 2023

sort_rank and highroad-style layering protomaps/basemaps#72

Open

HarelM mentioned this issue Mar 26, 2024

Fix benchmarks #3908

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign how benchmarks work #982

Redesign how benchmarks work #982

HarelM commented Feb 10, 2022 •

edited

Loading

github-actions bot commented Aug 10, 2022

birkskyum commented Aug 10, 2022

HarelM commented Aug 21, 2022

wipfli commented Feb 1, 2023

birkskyum commented Feb 1, 2023 •

edited

Loading

rotu commented Feb 2, 2023 •

edited

Loading

HarelM commented Feb 6, 2023

msbarry commented Jun 20, 2023 •

edited

Loading

msbarry commented Jun 20, 2023

birkskyum commented Jun 20, 2023

msbarry commented Jun 20, 2023

birkskyum commented Jun 20, 2023

HarelM commented Jun 20, 2023

msbarry commented Jun 21, 2023 •

edited

Loading

msbarry commented Jun 21, 2023

HarelM commented Jun 21, 2023

msbarry commented Jun 21, 2023 •

edited

Loading

HarelM commented Jun 22, 2023

msbarry commented Jun 22, 2023 •

edited

Loading

msbarry commented Jun 22, 2023

HarelM commented Jun 24, 2023

HarelM commented Jun 24, 2023

msbarry commented Jun 24, 2023

HarelM commented Jun 27, 2023

Redesign how benchmarks work #982

Redesign how benchmarks work #982

Comments

HarelM commented Feb 10, 2022 • edited Loading

Motivation

Design Alternatives

Design

Acceptance criteria

github-actions bot commented Aug 10, 2022

birkskyum commented Aug 10, 2022

HarelM commented Aug 21, 2022

wipfli commented Feb 1, 2023

birkskyum commented Feb 1, 2023 • edited Loading

rotu commented Feb 2, 2023 • edited Loading

HarelM commented Feb 6, 2023

msbarry commented Jun 20, 2023 • edited Loading

msbarry commented Jun 20, 2023

birkskyum commented Jun 20, 2023

msbarry commented Jun 20, 2023

birkskyum commented Jun 20, 2023

HarelM commented Jun 20, 2023

msbarry commented Jun 21, 2023 • edited Loading

msbarry commented Jun 21, 2023

HarelM commented Jun 21, 2023

msbarry commented Jun 21, 2023 • edited Loading

HarelM commented Jun 22, 2023

msbarry commented Jun 22, 2023 • edited Loading

msbarry commented Jun 22, 2023

HarelM commented Jun 24, 2023

HarelM commented Jun 24, 2023

msbarry commented Jun 24, 2023

HarelM commented Jun 27, 2023

HarelM commented Feb 10, 2022 •

edited

Loading

birkskyum commented Feb 1, 2023 •

edited

Loading

rotu commented Feb 2, 2023 •

edited

Loading

msbarry commented Jun 20, 2023 •

edited

Loading

msbarry commented Jun 21, 2023 •

edited

Loading

msbarry commented Jun 21, 2023 •

edited

Loading

msbarry commented Jun 22, 2023 •

edited

Loading