-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create and maintain a benchmarking suite for Wasmtime #4
Conversation
0000-benchmark-suite.md
Outdated
|
||
> Note that this approach is more coarsely grained than what Stabilizer | ||
> achieves. Stabilizer randomized code layout at the function granularity, while | ||
> this approach works at the object file granularity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe lld has an dedicated option to specify the section order. In any case a linker script should work. Every function is put in a different section by rustc, so randomizing section order should be enough.
0000-benchmark-suite.md
Outdated
statements like "we are 95% confident that change X produces a 5.5% (+/- 0.8%) | ||
speed up compared to the original system". See the "Challenge of Summarizing | ||
Results" and "Measuring Speedup" sections of *Rigorous Benchmarking in | ||
Reasonable Time*<sup>[4](#ref-4)</sup> for further details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be more bench runs when the difference on a benchmark is less statistically significant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The referenced paper grapples with this question, and ultimately gives you a formula where you can plug in the levels of variation (taken from some initial experiments) at each layer and a desired confidence interval and then it tells you how many iterations at each level you need to perform in your experiment before you actually conduct it. It's a good paper, I recommend reading it!
It is not recommended to at runtime keep running the benchmark for X minutes or until a given confidence level is reached because this makes results between different runs (e.g. with and without your changes) subtly incomparable.
0000-benchmark-suite.md
Outdated
For optimal developer ergonomics, it should be possible to run the benchmark | ||
suite remotely by leaving a special comment on a github pull request: | ||
|
||
> @bytecodealliance-bot bench |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who should be allowed to run it? Will there be a rate-limit per developer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is discussed a bit in #3 as well. I expect that we will use an explicit allow list. I don't think we need to worry about rate limiting right off the bat, since the allow list will be people we all trust and can just ping if there are any issues.
This looks fantastic to me, and thanks for taking the time to lay out milestones at the end, it was quite helpful to see what sort of work is necessary for each particular outcome! I'm pretty interested in how the nitty gritty of a lot of this will work, but that's out-of-scope for this RFC I think in that it's probably best to handle in code review and such! (I also left a comment on #3 about stability which may apply more here, but that's probably all out of scope of these two RFCs) |
@fitzgen This is fantastic! It seems to not merely describe criteria for adding a benchmark to a future benchmark suite but really focuses on details of a strategy for an approach to testing those workloads. This involves the characteristics of the benchmarks, specific testing features for wasmtime/cranelift, and characteristics of the runner. So there is a lot here, and a lot to potentially comment on but this clearly a ton of good practices with references that we should probably adopt. Agree with @alexcrichton that milestone description and graph is cool ... very nice to visualize the dependencies there. So all of this is well done, below I'll just focus some comments on things I would think about: From Motivation: Evaluating the speedup of a proposed performance optimization highlights the need for developers to be able to quickly run and maybe selectively run a benchmark that targets aspects of the compiler/runtime that they are targeting. This means being able to run offline before pushing a patch for review. This points to a need for a tight coupling between the distribution of wasmtime/cranelift and the runner in addition to maybe a separate offline/remote/nightly runner. Testing performance regressions can be done in tiers, a much faster run performs a smoke test that's part of patch merger and a separately the more comprehensive daily test as mentioned. At first I questioned the non-goal because while I see how that impacts decisions made for the runner and viewer, I wasn't quite sure about the term "general-purpose". I figure the benchmark suite would target the same aspects of wasmtime/cranelift that they would any VM and runtime right? Maybe the point here is that it is good to target the VM/runtime and not the underlying architecture (no inline assembly for example?)? Or good if we make any modifications to the runner or workload, to only be concerned about it working with wasmtime? One thing I would add is it is important I believe to have multiple perspectives when judging how well the execution performed (meaning multiple baselines). Certainly the previous run and previous several runs is a great baseline, but also testing against native will be important. A native baseline supports a third motivating factor for this infrastructure which is to find gaps in performance that need optimizations and improvement. From Proposals: Maybe this is second tier reporting and not default, but I would add a breakdown of wall time into user time and system time. Depending on the benchmark, kernel time can be dominant and so is good to factor in sooner rather than later during offline analysis. Also for candidate benchmarks, some of the requirements may force us from just taking a code as is without modification (adding means for more than one associated input for example) that may not be so critical. Real widely used programs is certainly desirable, but I personally don't have a problem with micro benchmarks and synthetic benchmarks if they serve their purpose. Microbenchmarks can be great at stress testing to the limits certain aspects of VM that you simply aren't going to get from a real world program. Furthermore, with real world programs you aren't always enlightened into what makes them tick, what aspects are you really timing and testing. With micro benchmarks it is no secret what you are testing and you can know when you should anticipate change in performance. That said, I must say that initial candidate list is very impressive and certainly this real world programs are a must priority. From Rationale and Alternatives: I agree with the drawback of not being able to add certain benchmarks because they assume javascript, but I will say there may be somethings we can harvest from those benchmarks with a little work or no work at all. JetStream2 for example has some benchmarks such as gcc_loops that can be compiled directly to wasm without modification and in fact is capable of being auto-vectorization. I've used it myself when looking at SIMD performance of Wasmtime but there maybe others there. Also, I wouldn't mind including shootout as or some subset of shootout for reasons I described, plus I think it help to inform some aspect of lucet development and as the two are merge it may be a good reference. From Open Questions: The main thing I'd touch on here is the runner. Can and should we evolve Sightglass's benchmark runner. My initial thoughts here are that I think ultimately it may be best to have a Sightglass 2.0. The way the driver is setup now it is really not intended to be tightly coupled with any one VM in order to allow adding any workload without criteria and any runtime. The original code is written in rust and perhaps much of this can be reused, though then again some of the ideas that @abrown has mentioned may suggest gutting this or may suggest a completely fresh start with only Wasmtime and the requirements of this RFC in mind. Also, not all workloads simply measure performance by time spent. Workloads can of course run for a fixed amount of time and then report the amount work done in terms of some metric or score. Other workloads have already put in place timers and print out the time of the important code for you. Do we want our infrastructure to allow workloads/benchmarks to report their own time where we can (through script) parse that data and still report it consistent with all the other benchmark report out? |
@lars-t-hansen says (quoted with permission):
|
Thanks for the comments @jlb6740! Replies inline below.
Yep, and I would expect that they would share 99% of their code. The main difference would be that a developer running locally would supply two commit hashes or branches to compare (e.g.
Yes, exactly. This is the point of having multiple input workloads for our benchmark candidates: one small workload to get a quick smoke test, and one larger workload for in-depth data.
Good idea, this is not only something we can record in the runner, but also something we can classify candidate benchmark programs by in our PCA.
Yeah, ideally we would only be need to add
Regarding not knowing what makes the program tick: by requiring source code for candidate programs, we can hopefully avoid this to some degree. Agreed that micro-benchmarks are easier to quickly grok, though. Micro-benchmarks have their place, and we can include some as we find blind spots and want to target specific things (like Lars also mentioned) but I think it makes sense, especially as we create the initial set of candidates, to focus on real programs.
Ah, great! Yeah, we can definitely include that as a candidate then.
Yeah, it has become clear to me that we can replace just the runner with a 2.0 version, and then make it spit out data in a compatible format that the rest of sightglass expects. This way we still keep the server, UI, and graphs and all that working. I'll start prototyping this on top of @abrown's prototype of the new runner.
The downside with workloads like "run X for Y amount of time" is that it is both harder to verify that the program produced the correct result (and we don't want to measure a buggy/broken program!) and counters like instructions retired are not semi-deterministic anymore, which makes comparing executions more difficult. Additionally, it is a different kind of program that the runner will have to interface with and understand how to run and report measurements for, which is additional engineering work for us. Ultimately, I don't think we want to add these kinds of benchmark programs to the suite. |
@penzin pointed me to https://github.com/binji/raw-wasm; since you suggest here that we add benchmarks that use some Wasm-related tools. https://github.com/binji/raw-wasm could provide some inputs to those tools. |
Co-authored-by: bjorn3 <bjorn3@users.noreply.github.com>
This is apparently where we've decided all RFCs (including unmerged RFCs with open PRs) go.
d577175
to
88ba65c
Compare
Motion to finalize with a disposition to mergeI'm proposing that we merge this RFC and I'll add the Stakeholders sign-offTagging all employees of BA-affiliated companies who have committed to the wasmtime/cranelift repo in the last three months plus anyone who has given feedback on this PR as a stakeholder. FastlyIBMIntelMozilla |
on the Web. Doing that well requires including interactions with the rest of the | ||
Web browser: JavaScript, rendering, and the DOM. Building and integrating a full | ||
Web browser is overkill for our purposes, and represents significant additional | ||
complexity that we would prefer to avoid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this RFC again, one thing that occurs to me is that it doesn't clarify whether "microbenchmarks" are in scope or not. On one hand, we might say that micro-benchmarks are in indeed in scope, specifically because we're not trying to build a General-Purpose benchmark suite, but instead just something to collect data to help identify performance changes over time. On the other hand, some parts of this proposal talk about a desire for a representative corpus. Could you clarify the intended stance on microbenchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my position is summarized in my response to Johnnie up thread:
Micro-benchmarks have their place, and we can include some as we find blind spots and want to target specific things (like Lars also mentioned) but I think it makes sense, especially as we create the initial set of candidates, to focus on real programs.
Multiple people have mentioned the desire to run microbenchmarks, and if we get value out of having them, I don't want to stand in the way for ideological reasons. As long as the microbenchmarks use the same benchmarking API/protocol that the rest of the corpus uses, supporting them should be straightforward and shouldn't require any additional engineering effort.
Entering Final Call Period
The FCP will end on January 22nd. |
I appear unable to approve this, but FWIW I have no objections. |
Yeah, this is unfortunately tricky to fix without an rfc-bot doing it for us. For now, using gh reviews seems like the best path forward, with the champion setting the check marks. Apologies for the rough edges here! |
The FCP has elapsed without any objections being raised. I'm going to merge this. Thanks everybody! |
Rendered RFC
Note: this is complimentary to RFC #3. Where that RFC is seeking consensus on whether we should establish benchmarking infrastructure at all, this RFC focuses on defining where we want our benchmarking story to end up at the end of the day: how we select benchmark programs, how we avoid measurement bias, how we perform sound and rigorous statistical analyses of the results, etc.