Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative measurements other than wall time #130

Open
quark-zju opened this issue Feb 27, 2018 · 15 comments
Open

Alternative measurements other than wall time #130

quark-zju opened this issue Feb 27, 2018 · 15 comments
Labels
Bigger Project A large project, probably involving major internal changes as well as new API surface Enhancement

Comments

@quark-zju
Copy link

It seems criterion.rs has hard-coded measurement to be wall time. While I have been thinking about measuring CPU time (probably more stable), I/O bytes (doable under Linux by reading /proc), or memory usage. If the "measurement" function can be replaced, would that be a welcomed PR?

@bheisler
Copy link
Owner

It would be welcome, yes, but that has the potential to be a major project. That's why I've been putting it off for a while now. I'd be happy to help if you're volunteering to tackle it, though.

I haven't fully figured out yet how I was planning to do this. There's a lot of things Criterion.rs currently does that don't make sense for measuring things like IO and memory usage. For example, the measurements of IO and memory usage per iteration are a lot more likely to be constants rather than distributions, so bootstrapping and reporting various statistics probably doesn't make sense. I think it would make sense to treat timing measurements separately from auxiliary measurements like IO and memory. The HTML reports and command-line output are also designed for timing information.

So, for IO and memory I think it would make sense to (optionally) measure those in addition to the timing, not instead of it. You'd have to change the timing loops in the Bencher struct, as well as add space in that struct to store the results. Then you'd have to expose that information up through the code and provide it to the reports (HTML and CLI) which could then display it appropriately. These auxiliary measurements should be done outside the timing measurement to avoid changing the measured time.

As for changing how the timing measurement itself works, that's tricky. Criterion.rs is designed to minimize the impact that the timing measurement itself has on the measured values, but it would still be preferable to ensure that there's no additional overhead added by this change and that could be difficult. The obvious way would be to make the Bencher struct generic over the timing measurement function, which would be a breaking change and force an 0.3 release. I'd prefer to avoid that, if possible.

What do you think?

@quark-zju
Copy link
Author

Sorry for the late reply. I have been distracted by other stuffs at work lately.

Thanks for the detailed explanation!

On bootstrapping and distribution - I agree the current logic is designed for measuring wall time. But it should be fine for other measurements - it won't produce wrong numbers but just waste some extra time if unchanged.

I think the main thing that is hard-coded about "time" and could affect the user experience of other measurements is the display of units like "ms", "us","ns". That sounds doable to be replaced by a user-provided value.

I also agree that memory/IO measurements could be tricky because they're platform-dependent. They are ideally implemented as separate addons, instead of core criterion feature.

If the code uses some complex caching, or non deterministic behavior like "try read data as much as possible, but stop after 100ms", it'd be non-constant. So I think it's fine to just use the existing code since it should also work in the constant case. Bootstrapping might be unnecessary, but it does not break the meaningfulness of the alternative measurement.

It does make sense to have other values measured with timing. I think a more general purposed setting is to allow users to choose a list of things to measure. Like [CPU time, and a customized memory measurement function], instead of enforced wall time tracking.

I'm still a bit busy with work stuff right now and might need some paperwork done to be able to actually contribute code here. I'll get back to you if I have some more detailed ideas after understanding the existing codebase.

@LutzCle
Copy link

LutzCle commented Oct 25, 2018

Another use-case (that I'm currently facing) is measuring hardware performance counters and GPU event timers. It would be really helpful if Criterion could provide support for user-specified timers/counters.

Would it make sense for Bencher to take one start and one stop closure, with some user-defined state passed from start to stop? That should enable all of the use-cases mentioned in this thread, and provide enough flexibility for other use-cases as well.

Providing default closures for regular CPU timers could deal with the backwards-compatibility. Perhaps the stop closure could return a value for statistics and plotting?

@hdevalence
Copy link

I'd like to be able to have Criterion report cycle counts using the RDTSC/RDTSCP instructions on x64. Currently it looks like Criterion is using std::time::Instant, which ultimately calls gettime. It would be great to be able to report cycle counts directly, especially for microbenchmarks. There's already a Rust wrapper for RDTSC , so it seems like the hardest part would be connecting everything up?

Would the closure approach suggested by @LutzCle work? I would be happy to work on it if so.

@bheisler
Copy link
Owner

Howdy folks. Thanks for your patience - I've been busy with work and another project for a while now, but I hope I'll be able to put that project to bed soon and make some time for Criterion.rs again.

In the meantime though, I've been thinking about this problem. I think I have a plan to support custom measurements and custom reports in a clean way. I think I could create a Measurement trait with a handful of associated types for the intermediate state, the analysis, and the report, and then the benchmark programmer (or external crates) could define their own implementations. I'm on mobile, so I hope you'll forgive me if I don't provide a code example of what I mean. If anyone is really curious I can write it up in more detail later.

Implementing this would have to be done in a major breaking change release unless I do something awkward to avoid that. Fortunately, I already intend to do exactly that sort of release in 0.3.0 - the move to a custom test framework (ie. #[criterion]) will require users to migrate anyway, so I might as well make all of the breaking changes at once.

@hdevalence
Copy link

That would be great, I'd be happy to help if it would be useful.

I made a hacky change in my fork (https://github.com/hdevalence/criterion.rs/commit/0083dbd5e2aa06d40b812600961073a8b553fc38) that replaces Instant::now() with RDTSC/P (via the tsc_timer crate). This obviously isn't a good or complete solution but it seems to work well enough for me for the moment, so it might be useful for someone else in the meantime.

@bheisler bheisler added Enhancement Bigger Project A large project, probably involving major internal changes as well as new API surface labels Dec 5, 2018
@bheisler bheisler added this to the Version 0.3.0 milestone Dec 9, 2018
@bheisler
Copy link
Owner

Hey, folks. It's been a while, but I've finally gotten around to implementing my preliminary support for custom measurements. You can check it out on the v0.3.0-dev branch; specifically look at src/measurement.rs and the user-guide documentation in book/src/user_guide/custom_measurements.md. Fair warning, there may be force-pushes to that branch from time to time, since I don't expect anyone else to commit to it.

@hdevalence - I'd appreciate it if you could try implementing a custom measurement based on RDTSC/P to validate that my design works for your use-case and that my documentation makes sense. If you'd like to publish that as a reusable crate later, once 0.3.0 is released, that would be great too; I'd be happy to add a link to it in the readme to help other folks who want CPU-counter level timing find it.

@hdevalence
Copy link

Awesome! Right now I'm kind of busy but I should have time to try implementing a custom measurement at the beginning of July.

@shepmaster
Copy link

Chiming in that this sounds like a great feature!

I’ve got some tests that feel like they are being really affected by thermal constraints on my laptop (back-to-back runs can have a variation of ~14%, for example), so I was looking for something a bit more stable. I’ll add in a +1 for “instruction count” and/or “cycle count”.

See a related thread on IRLO that describes why perf.rust-lang.org uses what it does.

@hdevalence
Copy link

@shepmaster this is independent of anything to do with criterion or the content of this PR, but just in case it's useful to you or someone else reading this thread, it's possible on Linux to manage the Turbo Boost state using systemd, so that it's easy to temporarily disable it for testing: https://blog.christophersmart.com/2017/02/08/manage-intel-turbo-boost-with-systemd/

@bheisler bheisler removed this from the Version 0.3.0 milestone Aug 25, 2019
@YangKeao
Copy link
Contributor

YangKeao commented Aug 26, 2019

@bheisler @hdevalence Maybe you are interested. I just implemented a POSIX CPU time measurement today (README and introduction will be added tomorrow). It's surprising that implementing a measurement takes only minutes (though it has not been well tested).

Should third party measurement be added into criterion (with a PR)?

Or should all measurement be extracted out of criterion (as discussed, this will break backwards-compatibility)? Just like serde and serde-json etc...

In my implementation DurationFormatter was directly copied from criterion. Could such useful utils be public or be extracted out as a dependency?

@YangKeao
Copy link
Contributor

@bheisler Migrate from 0.2 to 0.3 is painful. One of the reason is adding generic parameter everywhere 😢 (though this is reasonable). I have done this for a quite big project today. It changes thousands of lines of codes.

Another pain point is some lost functions like bench_function_over_inputs. Without it, I have to provide a clonable closure (for multiple usage in loop and bench_with_input), which cannot be written in Box<dyn Fn(..) + Clone> as dyn syntax doesn't allow it.

@bheisler
Copy link
Owner

Migrate from 0.2 to 0.3 is painful. One of the reason is adding generic parameter everywhere 😢 (though this is reasonable). I have done this for a quite big project today. It changes thousands of lines of codes.

The new generic parameter is unfortunately necessary; we need to allow rustc to inline the measurement code into the benchmarks. It looks like tikv built a lot of code on top of Criterion.rs and they want to use CPU time instead of wall-clock time so they'll have to update.

Should third party measurement be added into criterion (with a PR)?

My plan (for now at least) is to allow external crates to provide different measurements and formatters. We can add them into Criterion.rs itself later if it makes sense.

Another pain point is some lost functions like bench_function_over_inputs. Without it, I have to provide a clonable closure (for multiple usage in loop and bench_with_input), which cannot be written in Box<dyn Fn(..) + Clone> as dyn syntax doesn't allow it.

As for bench_function_over_inputs, that's been deprecated for a while. I would recommend updating your code to use the BenchmarkGroup structure. I don't know why you would need to clone a closure, do you have an example?

@YangKeao
Copy link
Contributor

YangKeao commented Aug 27, 2019

@bheisler I noticed that the only way to apply multiple input on a function is loop if I didn't miss some important functions.

I have to change from:

for case in cases {
    c.bench_function_over_inputs(case.name, case.f, inputs.clone());
}

Into:

for case in cases {
    let mut group = c.benchmark_group(case.name);
    for input in inputs.iter() {
        group.bench_with_input(
            criterion::BenchmarkId::from_parameter(input),
            input,
            case.f,
        ); 
    }
    group.finish();
}

In this case, case.f have to be copied as it was used in a loop. The type of case.f used to be Box<dyn Fn(...) + 'static> and it's impossible to ensure it can clone.

To resolve this, I use an ugly way. The implementation is:

  1. For restricting Clone trait on closure, I have to use generic type. So creating a struct InnerBenchCase which has a generic type called F with restriction Fn(&mut criterion::Bencher<M>, &I) + Copy + 'static

  2. As I need multiple InnerBenchCase with different F (every closure's types differ) to store in a vector, a trait IBenchCase was used to provide a connection between these different InnerBenchCase. This trait provides method get_fn, which will copy the inner function and return.

  3. This point is not related with criterion. For some reasons, I have to implement Ord on bench case. However, impl<T> Ord for T where T:IBenchCase is not allowed. So I have to use a struct with dynamic dispatch to implement Ord.

pub struct BenchCase<I, M>
where
    M: criterion::measurement::Measurement + 'static,
{
    inner: Box<dyn IBenchCase<I = I, M = M>>,
}

After these step, I finally can use bench_with_input instead of bench_function_over_inputs.

@WildCryptoFox
Copy link

@hdevalence CyclesPerByte PR open #336. Only for x86 and x86_64, assuming the cpu feature is present as we'd panic anyway, so it is feature gated under cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bigger Project A large project, probably involving major internal changes as well as new API surface Enhancement
Projects
None yet
Development

No branches or pull requests

7 participants