-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative measurements other than wall time #130
Comments
It would be welcome, yes, but that has the potential to be a major project. That's why I've been putting it off for a while now. I'd be happy to help if you're volunteering to tackle it, though. I haven't fully figured out yet how I was planning to do this. There's a lot of things Criterion.rs currently does that don't make sense for measuring things like IO and memory usage. For example, the measurements of IO and memory usage per iteration are a lot more likely to be constants rather than distributions, so bootstrapping and reporting various statistics probably doesn't make sense. I think it would make sense to treat timing measurements separately from auxiliary measurements like IO and memory. The HTML reports and command-line output are also designed for timing information. So, for IO and memory I think it would make sense to (optionally) measure those in addition to the timing, not instead of it. You'd have to change the timing loops in the Bencher struct, as well as add space in that struct to store the results. Then you'd have to expose that information up through the code and provide it to the reports (HTML and CLI) which could then display it appropriately. These auxiliary measurements should be done outside the timing measurement to avoid changing the measured time. As for changing how the timing measurement itself works, that's tricky. Criterion.rs is designed to minimize the impact that the timing measurement itself has on the measured values, but it would still be preferable to ensure that there's no additional overhead added by this change and that could be difficult. The obvious way would be to make the Bencher struct generic over the timing measurement function, which would be a breaking change and force an 0.3 release. I'd prefer to avoid that, if possible. What do you think? |
Sorry for the late reply. I have been distracted by other stuffs at work lately. Thanks for the detailed explanation! On bootstrapping and distribution - I agree the current logic is designed for measuring wall time. But it should be fine for other measurements - it won't produce wrong numbers but just waste some extra time if unchanged. I think the main thing that is hard-coded about "time" and could affect the user experience of other measurements is the display of units like "ms", "us","ns". That sounds doable to be replaced by a user-provided value. I also agree that memory/IO measurements could be tricky because they're platform-dependent. They are ideally implemented as separate addons, instead of core criterion feature. If the code uses some complex caching, or non deterministic behavior like "try read data as much as possible, but stop after 100ms", it'd be non-constant. So I think it's fine to just use the existing code since it should also work in the constant case. Bootstrapping might be unnecessary, but it does not break the meaningfulness of the alternative measurement. It does make sense to have other values measured with timing. I think a more general purposed setting is to allow users to choose a list of things to measure. Like [CPU time, and a customized memory measurement function], instead of enforced wall time tracking. I'm still a bit busy with work stuff right now and might need some paperwork done to be able to actually contribute code here. I'll get back to you if I have some more detailed ideas after understanding the existing codebase. |
Another use-case (that I'm currently facing) is measuring hardware performance counters and GPU event timers. It would be really helpful if Criterion could provide support for user-specified timers/counters. Would it make sense for Bencher to take one start and one stop closure, with some user-defined state passed from start to stop? That should enable all of the use-cases mentioned in this thread, and provide enough flexibility for other use-cases as well. Providing default closures for regular CPU timers could deal with the backwards-compatibility. Perhaps the stop closure could return a value for statistics and plotting? |
I'd like to be able to have Criterion report cycle counts using the RDTSC/RDTSCP instructions on x64. Currently it looks like Criterion is using Would the closure approach suggested by @LutzCle work? I would be happy to work on it if so. |
Howdy folks. Thanks for your patience - I've been busy with work and another project for a while now, but I hope I'll be able to put that project to bed soon and make some time for Criterion.rs again. In the meantime though, I've been thinking about this problem. I think I have a plan to support custom measurements and custom reports in a clean way. I think I could create a Measurement trait with a handful of associated types for the intermediate state, the analysis, and the report, and then the benchmark programmer (or external crates) could define their own implementations. I'm on mobile, so I hope you'll forgive me if I don't provide a code example of what I mean. If anyone is really curious I can write it up in more detail later. Implementing this would have to be done in a major breaking change release unless I do something awkward to avoid that. Fortunately, I already intend to do exactly that sort of release in 0.3.0 - the move to a custom test framework (ie. |
That would be great, I'd be happy to help if it would be useful. I made a hacky change in my fork (https://github.com/hdevalence/criterion.rs/commit/0083dbd5e2aa06d40b812600961073a8b553fc38) that replaces |
Hey, folks. It's been a while, but I've finally gotten around to implementing my preliminary support for custom measurements. You can check it out on the @hdevalence - I'd appreciate it if you could try implementing a custom measurement based on RDTSC/P to validate that my design works for your use-case and that my documentation makes sense. If you'd like to publish that as a reusable crate later, once 0.3.0 is released, that would be great too; I'd be happy to add a link to it in the readme to help other folks who want CPU-counter level timing find it. |
Awesome! Right now I'm kind of busy but I should have time to try implementing a custom measurement at the beginning of July. |
Chiming in that this sounds like a great feature! I’ve got some tests that feel like they are being really affected by thermal constraints on my laptop (back-to-back runs can have a variation of ~14%, for example), so I was looking for something a bit more stable. I’ll add in a +1 for “instruction count” and/or “cycle count”. See a related thread on IRLO that describes why perf.rust-lang.org uses what it does. |
@shepmaster this is independent of anything to do with criterion or the content of this PR, but just in case it's useful to you or someone else reading this thread, it's possible on Linux to manage the Turbo Boost state using systemd, so that it's easy to temporarily disable it for testing: https://blog.christophersmart.com/2017/02/08/manage-intel-turbo-boost-with-systemd/ |
@bheisler @hdevalence Maybe you are interested. I just implemented a POSIX CPU time measurement today (README and introduction will be added tomorrow). It's surprising that implementing a measurement takes only minutes (though it has not been well tested). Should third party measurement be added into criterion (with a PR)? Or should all measurement be extracted out of criterion (as discussed, this will break backwards-compatibility)? Just like In my implementation |
@bheisler Migrate from 0.2 to 0.3 is painful. One of the reason is adding generic parameter everywhere 😢 (though this is reasonable). I have done this for a quite big project today. It changes thousands of lines of codes. Another pain point is some lost functions like |
The new generic parameter is unfortunately necessary; we need to allow rustc to inline the measurement code into the benchmarks. It looks like tikv built a lot of code on top of Criterion.rs and they want to use CPU time instead of wall-clock time so they'll have to update.
My plan (for now at least) is to allow external crates to provide different measurements and formatters. We can add them into Criterion.rs itself later if it makes sense.
As for |
@bheisler I noticed that the only way to apply multiple input on a function is loop if I didn't miss some important functions. I have to change from:
Into:
In this case, case.f have to be copied as it was used in a loop. The type of To resolve this, I use an ugly way. The implementation is:
After these step, I finally can use |
@hdevalence |
It seems criterion.rs has hard-coded measurement to be wall time. While I have been thinking about measuring CPU time (probably more stable), I/O bytes (doable under Linux by reading /proc), or memory usage. If the "measurement" function can be replaced, would that be a welcomed PR?
The text was updated successfully, but these errors were encountered: