Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and maintain a benchmarking suite for Wasmtime #4

Merged
merged 9 commits into from
Jan 22, 2021

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Nov 18, 2020

Rendered RFC


Note: this is complimentary to RFC #3. Where that RFC is seeking consensus on whether we should establish benchmarking infrastructure at all, this RFC focuses on defining where we want our benchmarking story to end up at the end of the day: how we select benchmark programs, how we avoid measurement bias, how we perform sound and rigorous statistical analyses of the results, etc.


> Note that this approach is more coarsely grained than what Stabilizer
> achieves. Stabilizer randomized code layout at the function granularity, while
> this approach works at the object file granularity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe lld has an dedicated option to specify the section order. In any case a linker script should work. Every function is put in a different section by rustc, so randomizing section order should be enough.

statements like "we are 95% confident that change X produces a 5.5% (+/- 0.8%)
speed up compared to the original system". See the "Challenge of Summarizing
Results" and "Measuring Speedup" sections of *Rigorous Benchmarking in
Reasonable Time*<sup>[4](#ref-4)</sup> for further details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be more bench runs when the difference on a benchmark is less statistically significant?

Copy link
Member Author

@fitzgen fitzgen Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The referenced paper grapples with this question, and ultimately gives you a formula where you can plug in the levels of variation (taken from some initial experiments) at each layer and a desired confidence interval and then it tells you how many iterations at each level you need to perform in your experiment before you actually conduct it. It's a good paper, I recommend reading it!

It is not recommended to at runtime keep running the benchmark for X minutes or until a given confidence level is reached because this makes results between different runs (e.g. with and without your changes) subtly incomparable.

For optimal developer ergonomics, it should be possible to run the benchmark
suite remotely by leaving a special comment on a github pull request:

> @bytecodealliance-bot bench
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who should be allowed to run it? Will there be a rate-limit per developer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is discussed a bit in #3 as well. I expect that we will use an explicit allow list. I don't think we need to worry about rate limiting right off the bat, since the allow list will be people we all trust and can just ping if there are any issues.

0000-benchmark-suite.md Outdated Show resolved Hide resolved
0000-benchmark-suite.md Outdated Show resolved Hide resolved
@alexcrichton
Copy link
Member

This looks fantastic to me, and thanks for taking the time to lay out milestones at the end, it was quite helpful to see what sort of work is necessary for each particular outcome! I'm pretty interested in how the nitty gritty of a lot of this will work, but that's out-of-scope for this RFC I think in that it's probably best to handle in code review and such! (I also left a comment on #3 about stability which may apply more here, but that's probably all out of scope of these two RFCs)

@jlb6740
Copy link

jlb6740 commented Nov 19, 2020

@fitzgen This is fantastic! It seems to not merely describe criteria for adding a benchmark to a future benchmark suite but really focuses on details of a strategy for an approach to testing those workloads. This involves the characteristics of the benchmarks, specific testing features for wasmtime/cranelift, and characteristics of the runner. So there is a lot here, and a lot to potentially comment on but this clearly a ton of good practices with references that we should probably adopt. Agree with @alexcrichton that milestone description and graph is cool ... very nice to visualize the dependencies there. So all of this is well done, below I'll just focus some comments on things I would think about:

From Motivation:

Evaluating the speedup of a proposed performance optimization highlights the need for developers to be able to quickly run and maybe selectively run a benchmark that targets aspects of the compiler/runtime that they are targeting. This means being able to run offline before pushing a patch for review. This points to a need for a tight coupling between the distribution of wasmtime/cranelift and the runner in addition to maybe a separate offline/remote/nightly runner. Testing performance regressions can be done in tiers, a much faster run performs a smoke test that's part of patch merger and a separately the more comprehensive daily test as mentioned.

At first I questioned the non-goal because while I see how that impacts decisions made for the runner and viewer, I wasn't quite sure about the term "general-purpose". I figure the benchmark suite would target the same aspects of wasmtime/cranelift that they would any VM and runtime right? Maybe the point here is that it is good to target the VM/runtime and not the underlying architecture (no inline assembly for example?)? Or good if we make any modifications to the runner or workload, to only be concerned about it working with wasmtime? One thing I would add is it is important I believe to have multiple perspectives when judging how well the execution performed (meaning multiple baselines). Certainly the previous run and previous several runs is a great baseline, but also testing against native will be important. A native baseline supports a third motivating factor for this infrastructure which is to find gaps in performance that need optimizations and improvement.

From Proposals:

Maybe this is second tier reporting and not default, but I would add a breakdown of wall time into user time and system time. Depending on the benchmark, kernel time can be dominant and so is good to factor in sooner rather than later during offline analysis. Also for candidate benchmarks, some of the requirements may force us from just taking a code as is without modification (adding means for more than one associated input for example) that may not be so critical.

Real widely used programs is certainly desirable, but I personally don't have a problem with micro benchmarks and synthetic benchmarks if they serve their purpose. Microbenchmarks can be great at stress testing to the limits certain aspects of VM that you simply aren't going to get from a real world program. Furthermore, with real world programs you aren't always enlightened into what makes them tick, what aspects are you really timing and testing. With micro benchmarks it is no secret what you are testing and you can know when you should anticipate change in performance. That said, I must say that initial candidate list is very impressive and certainly this real world programs are a must priority.

From Rationale and Alternatives:

I agree with the drawback of not being able to add certain benchmarks because they assume javascript, but I will say there may be somethings we can harvest from those benchmarks with a little work or no work at all. JetStream2 for example has some benchmarks such as gcc_loops that can be compiled directly to wasm without modification and in fact is capable of being auto-vectorization. I've used it myself when looking at SIMD performance of Wasmtime but there maybe others there. Also, I wouldn't mind including shootout as or some subset of shootout for reasons I described, plus I think it help to inform some aspect of lucet development and as the two are merge it may be a good reference.

From Open Questions:

The main thing I'd touch on here is the runner. Can and should we evolve Sightglass's benchmark runner. My initial thoughts here are that I think ultimately it may be best to have a Sightglass 2.0. The way the driver is setup now it is really not intended to be tightly coupled with any one VM in order to allow adding any workload without criteria and any runtime. The original code is written in rust and perhaps much of this can be reused, though then again some of the ideas that @abrown has mentioned may suggest gutting this or may suggest a completely fresh start with only Wasmtime and the requirements of this RFC in mind.

Also, not all workloads simply measure performance by time spent. Workloads can of course run for a fixed amount of time and then report the amount work done in terms of some metric or score. Other workloads have already put in place timers and print out the time of the important code for you. Do we want our infrastructure to allow workloads/benchmarks to report their own time where we can (through script) parse that data and still report it consistent with all the other benchmark report out?

@fitzgen
Copy link
Member Author

fitzgen commented Nov 30, 2020

@lars-t-hansen says (quoted with permission):

I support this. I especially support the Motivation and avoiding a benchmarking race-to-the-bottom... Thanks for doing it!

Everything looks really solid and I don't think I have anything to contribute at this point, except I think that if we could find some way for this to work with the JS shells its value would increase significantly. A discussion for another day or a derivative effort.

Hyperthreading and NUMA are concerns that need to be addressed when running. (Whenever I run compilation time tests on my dev system I need to be sure to disable NUMA to get repeatable and meaningful results. Possibly systems in data centers that will run regression tests are more likely NUMA than developer desktop systems?)

One gut feeling I have is that to regression test some types of changes you really want microbenchmarks, not applications, because you don't know that your applications focus enough on the feature you're trying to safeguard in the context where it is valuable. (For the same reason I have started adding test cases that check that a specific instruction is generated by the compiler at least in a specific controlled context, so as to check that an optimization works in at least that context.) But that's really additive to what you're proposing, not conflicting with it.

@fitzgen
Copy link
Member Author

fitzgen commented Dec 10, 2020

Thanks for the comments @jlb6740! Replies inline below.

Evaluating the speedup of a proposed performance optimization highlights the need for developers to be able to quickly run and maybe selectively run a benchmark that targets aspects of the compiler/runtime that they are targeting. This means being able to run offline before pushing a patch for review. This points to a need for a tight coupling between the distribution of wasmtime/cranelift and the runner in addition to maybe a separate offline/remote/nightly runner.

Yep, and I would expect that they would share 99% of their code. The main difference would be that a developer running locally would supply two commit hashes or branches to compare (e.g. main and my-optimization-branch) and the runner would build and run benchmarks for both, where as in the offline/remote/nightly automation the runner would only build and run the latest main version of wasmtime.

Testing performance regressions can be done in tiers, a much faster run performs a smoke test that's part of patch merger and a separately the more comprehensive daily test as mentioned.

Yes, exactly. This is the point of having multiple input workloads for our benchmark candidates: one small workload to get a quick smoke test, and one larger workload for in-depth data.

Maybe this is second tier reporting and not default, but I would add a breakdown of wall time into user time and system time. Depending on the benchmark, kernel time can be dominant and so is good to factor in sooner rather than later during offline analysis.

Good idea, this is not only something we can record in the runner, but also something we can classify candidate benchmark programs by in our PCA.

Also for candidate benchmarks, some of the requirements may force us from just taking a code as is without modification (adding means for more than one associated input for example) that may not be so critical.

Yeah, ideally we would only be need to add bench.start and bench.end calls. As a practicality, it may make more sense to err on the side of avoiding benchmark programs that aren't well-factored from their workloads, because otherwise updating such programs in the future may become a real chore.

Real widely used programs is certainly desirable, but I personally don't have a problem with micro benchmarks and synthetic benchmarks if they serve their purpose. Microbenchmarks can be great at stress testing to the limits certain aspects of VM that you simply aren't going to get from a real world program. Furthermore, with real world programs you aren't always enlightened into what makes them tick, what aspects are you really timing and testing. With micro benchmarks it is no secret what you are testing and you can know when you should anticipate change in performance.

Regarding not knowing what makes the program tick: by requiring source code for candidate programs, we can hopefully avoid this to some degree. Agreed that micro-benchmarks are easier to quickly grok, though.

Micro-benchmarks have their place, and we can include some as we find blind spots and want to target specific things (like Lars also mentioned) but I think it makes sense, especially as we create the initial set of candidates, to focus on real programs.

I agree with the drawback of not being able to add certain benchmarks because they assume javascript, but I will say there may be somethings we can harvest from those benchmarks with a little work or no work at all. JetStream2 for example has some benchmarks such as gcc_loops that can be compiled directly to wasm without modification and in fact is capable of being auto-vectorization. I've used it myself when looking at SIMD performance of Wasmtime but there maybe others there.

Ah, great! Yeah, we can definitely include that as a candidate then.

The main thing I'd touch on here is the runner. Can and should we evolve Sightglass's benchmark runner. My initial thoughts here are that I think ultimately it may be best to have a Sightglass 2.0. The way the driver is setup now it is really not intended to be tightly coupled with any one VM in order to allow adding any workload without criteria and any runtime. The original code is written in rust and perhaps much of this can be reused, though then again some of the ideas that @abrown has mentioned may suggest gutting this or may suggest a completely fresh start with only Wasmtime and the requirements of this RFC in mind.

Yeah, it has become clear to me that we can replace just the runner with a 2.0 version, and then make it spit out data in a compatible format that the rest of sightglass expects. This way we still keep the server, UI, and graphs and all that working. I'll start prototyping this on top of @abrown's prototype of the new runner.

Also, not all workloads simply measure performance by time spent. Workloads can of course run for a fixed amount of time and then report the amount work done in terms of some metric or score. Other workloads have already put in place timers and print out the time of the important code for you. Do we want our infrastructure to allow workloads/benchmarks to report their own time where we can (through script) parse that data and still report it consistent with all the other benchmark report out?

The downside with workloads like "run X for Y amount of time" is that it is both harder to verify that the program produced the correct result (and we don't want to measure a buggy/broken program!) and counters like instructions retired are not semi-deterministic anymore, which makes comparing executions more difficult. Additionally, it is a different kind of program that the runner will have to interface with and understand how to run and report measurements for, which is additional engineering work for us. Ultimately, I don't think we want to add these kinds of benchmark programs to the suite.

@abrown
Copy link
Contributor

abrown commented Jan 6, 2021

@penzin pointed me to https://github.com/binji/raw-wasm; since you suggest here that we add benchmarks that use some Wasm-related tools. https://github.com/binji/raw-wasm could provide some inputs to those tools.

@fitzgen
Copy link
Member Author

fitzgen commented Jan 11, 2021

Motion to finalize with a disposition to merge

I'm proposing that we merge this RFC and I'll add the motion-to-finalize and disposition-merge tags to the PR. The PR has received numerous pieces of positive feedback and we have milestones for incremental implementation. Prototype work on the implementation has already begun.

Stakeholders sign-off

Tagging all employees of BA-affiliated companies who have committed to the wasmtime/cranelift repo in the last three months plus anyone who has given feedback on this PR as a stakeholder.

Fastly

IBM

Intel

Mozilla

on the Web. Doing that well requires including interactions with the rest of the
Web browser: JavaScript, rendering, and the DOM. Building and integrating a full
Web browser is overkill for our purposes, and represents significant additional
complexity that we would prefer to avoid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this RFC again, one thing that occurs to me is that it doesn't clarify whether "microbenchmarks" are in scope or not. On one hand, we might say that micro-benchmarks are in indeed in scope, specifically because we're not trying to build a General-Purpose benchmark suite, but instead just something to collect data to help identify performance changes over time. On the other hand, some parts of this proposal talk about a desire for a representative corpus. Could you clarify the intended stance on microbenchmarks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my position is summarized in my response to Johnnie up thread:

Micro-benchmarks have their place, and we can include some as we find blind spots and want to target specific things (like Lars also mentioned) but I think it makes sense, especially as we create the initial set of candidates, to focus on real programs.

Multiple people have mentioned the desire to run microbenchmarks, and if we get value out of having them, I don't want to stand in the way for ideological reasons. As long as the microbenchmarks use the same benchmarking API/protocol that the rest of the corpus uses, supporting them should be straightforward and shouldn't require any additional engineering effort.

accepted/benchmark-suite.md Outdated Show resolved Hide resolved
@fitzgen
Copy link
Member Author

fitzgen commented Jan 11, 2021

Entering Final Call Period

https://github.com/bytecodealliance/rfcs/blob/main/accepted/rfc-process.md#making-a-decision-merge-or-close

Once any stakeholder from a different group has signed off, the RFC will move into a 10 calendar day final comment period (FCP), long enough to ensure that other stakeholders have at least a full business week to respond.

The FCP will end on January 22nd.

@lars-t-hansen
Copy link

I appear unable to approve this, but FWIW I have no objections.

@tschneidereit
Copy link
Member

I appear unable to approve this, but FWIW I have no objections.

Yeah, this is unfortunately tricky to fix without an rfc-bot doing it for us. For now, using gh reviews seems like the best path forward, with the champion setting the check marks. Apologies for the rough edges here!

@fitzgen
Copy link
Member Author

fitzgen commented Jan 22, 2021

The FCP has elapsed without any objections being raised. I'm going to merge this. Thanks everybody!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants