Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inadequate microbenchmarks #9

Closed
workingjubilee opened this issue Jul 8, 2023 · 5 comments
Closed

Inadequate microbenchmarks #9

workingjubilee opened this issue Jul 8, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@workingjubilee
Copy link

workingjubilee commented Jul 8, 2023

Your benchmarks are unrealistically simple and do not cover common concerns:

And your benchmarks are... probably inadequate. You are microbenching the most primitive operation. It can still be annoyingly obvious to the optimizer if the result is e.g. unused. You need more holistic tests that demonstrate Trc improves a semi-realistic workload. I mean, strictly speaking, that isn't necessary for acceptance to std, I just think you should, in order to better support your claim.

btw since your implementation does two allocations in new() rather than just one, you should probably also include a benchmark for Trc::new(). (and perhaps also one for converting between SharedTrc and Trc.)

And you should make sure the tested functions are actually generating different assembly for these types and operations when compiled. Otherwise, you may be actually microbenchmarking something else, like the quality of LLVM's optimizer in piercing certain abstractions. No, "black box" functions are not actually completely immune to the optimizer's gaze.

@EricLBuehler EricLBuehler self-assigned this Jul 8, 2023
@EricLBuehler EricLBuehler added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jul 8, 2023
@workingjubilee
Copy link
Author

workingjubilee commented Jul 9, 2023

What platform do you run your benchmarks on?

Rust's Instant dispatches pretty directly to the underlying system time primitives, and that's the default type used for benchmarking stuff. You can see the code in Criterion which calls it here:

/// `WallTime` is the default measurement in Criterion.rs. It measures the elapsed time from the
/// beginning of a series of iterations to the end.
pub struct WallTime;
impl Measurement for WallTime {
    type Intermediate = Instant;
    type Value = Duration;

    fn start(&self) -> Self::Intermediate {
        Instant::now()
    }
    fn end(&self, i: Self::Intermediate) -> Self::Value {
        i.elapsed()
    }
    fn add(&self, v1: &Self::Value, v2: &Self::Value) -> Self::Value {
        *v1 + *v2
    }
    fn zero(&self) -> Self::Value {
        Duration::from_secs(0)
    }
    fn to_f64(&self, val: &Self::Value) -> f64 {
        val.as_nanos() as f64
    }
    fn formatter(&self) -> &dyn ValueFormatter {
        &DurationFormatter
    }
}

The problem is that depending on your CPU and OS, it may expose more or less granular time measurements, so the numbers you get may fail to capture the subtleties of timing because they may have a too-high floor for how much time can be measured. And some OS have... other quirks going on in their timing APIs, unfortunately.

An important benchmark would be to actually create an Arc or SharedTrc, depending, spawn some threads, and then test how the types fare in performance when subjected to the contended clone-and-drop case, so that the refcounts are going up and down while the CPU has to manage both increase and decrease across multiple threads. You may find rayon helpful for this, but also this would need to consider... many more than 100 operations, as merely spawning the threads and running the work queue with rayon would become the largest thing in the benchmark. Directly managing thread spawning might be preferable. There's also probably ways to use Criterion's bench_with_input to make such a benchmark only test the actual operations, not the thread spawning.

Also, you may want to test something that can flush out the false sharing problem since that might be interesting if Arc and SharedTrc differ in performance there.

@EricLBuehler
Copy link
Owner

Hi, @workingjubilee, I currently am running my benchmarks on a WSL2 instance on my Windows machine. I have noticed that with pure Instant code, then the granularity is 100ns. This is why I was averaging the times before I used Criterion.

As for a contended clone-and-drop benchmark, I will be sure to add that - I will also look into using bench_with_input. Do you think that using std::thread or rayon would be better for this? I may use both for more diversity of benchmarks.

The false sharing problem seems to be something that both Arc and Trc have. However, I would think that Trc has more as the data and local refcount are kept seperate. However, this is a design necessity and I am not sure how it could be improved. Perhaps the data could be cached with the local refcount - this would also not interfere with get_mut.

@workingjubilee
Copy link
Author

workingjubilee commented Jul 9, 2023

std::thread means you may have to bump the operations to 1000 in order to get modestly-useful benchmarks, rayon may be more like 10000.

A lot of people will be using rayon to do stuff, so it'd be good to have a test that uses that, yes.

You can run bare-metal Linux in order to get better precision for these benchmarks, I believe, as Windows doesn't expose as precise a timer, and WSL2 is Linux run in a hypervisor, so some of Windows and its quirks are still "visible" (as it is ultimately basically running Linux inside a cut-down Windows).

@EricLBuehler
Copy link
Owner

Thank you for the advice - I have implemented those tests! See the readme for the results. Please let me know if you think this issue should be closed, or otherwise further criteria.

I plan on adding bare-metal Linux benhmarks soon.

@EricLBuehler
Copy link
Owner

As of commit a37e0e6, I believe that these issues are fixed. Please let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants