Inadequate microbenchmarks #9

workingjubilee · 2023-07-08T19:46:55Z

Your benchmarks are unrealistically simple and do not cover common concerns:

And your benchmarks are... probably inadequate. You are microbenching the most primitive operation. It can still be annoyingly obvious to the optimizer if the result is e.g. unused. You need more holistic tests that demonstrate Trc improves a semi-realistic workload. I mean, strictly speaking, that isn't necessary for acceptance to std, I just think you should, in order to better support your claim.

btw since your implementation does two allocations in new() rather than just one, you should probably also include a benchmark for Trc::new(). (and perhaps also one for converting between SharedTrc and Trc.)

And you should make sure the tested functions are actually generating different assembly for these types and operations when compiled. Otherwise, you may be actually microbenchmarking something else, like the quality of LLVM's optimizer in piercing certain abstractions. No, "black box" functions are not actually completely immune to the optimizer's gaze.

workingjubilee · 2023-07-09T04:33:45Z

What platform do you run your benchmarks on?

Rust's Instant dispatches pretty directly to the underlying system time primitives, and that's the default type used for benchmarking stuff. You can see the code in Criterion which calls it here:

/// `WallTime` is the default measurement in Criterion.rs. It measures the elapsed time from the
/// beginning of a series of iterations to the end.
pub struct WallTime;
impl Measurement for WallTime {
    type Intermediate = Instant;
    type Value = Duration;

    fn start(&self) -> Self::Intermediate {
        Instant::now()
    }
    fn end(&self, i: Self::Intermediate) -> Self::Value {
        i.elapsed()
    }
    fn add(&self, v1: &Self::Value, v2: &Self::Value) -> Self::Value {
        *v1 + *v2
    }
    fn zero(&self) -> Self::Value {
        Duration::from_secs(0)
    }
    fn to_f64(&self, val: &Self::Value) -> f64 {
        val.as_nanos() as f64
    }
    fn formatter(&self) -> &dyn ValueFormatter {
        &DurationFormatter
    }
}

The problem is that depending on your CPU and OS, it may expose more or less granular time measurements, so the numbers you get may fail to capture the subtleties of timing because they may have a too-high floor for how much time can be measured. And some OS have... other quirks going on in their timing APIs, unfortunately.

An important benchmark would be to actually create an Arc or SharedTrc, depending, spawn some threads, and then test how the types fare in performance when subjected to the contended clone-and-drop case, so that the refcounts are going up and down while the CPU has to manage both increase and decrease across multiple threads. You may find rayon helpful for this, but also this would need to consider... many more than 100 operations, as merely spawning the threads and running the work queue with rayon would become the largest thing in the benchmark. Directly managing thread spawning might be preferable. There's also probably ways to use Criterion's bench_with_input to make such a benchmark only test the actual operations, not the thread spawning.

Also, you may want to test something that can flush out the false sharing problem since that might be interesting if Arc and SharedTrc differ in performance there.

EricLBuehler · 2023-07-09T09:41:48Z

Hi, @workingjubilee, I currently am running my benchmarks on a WSL2 instance on my Windows machine. I have noticed that with pure Instant code, then the granularity is 100ns. This is why I was averaging the times before I used Criterion.

As for a contended clone-and-drop benchmark, I will be sure to add that - I will also look into using bench_with_input. Do you think that using std::thread or rayon would be better for this? I may use both for more diversity of benchmarks.

The false sharing problem seems to be something that both Arc and Trc have. However, I would think that Trc has more as the data and local refcount are kept seperate. However, this is a design necessity and I am not sure how it could be improved. Perhaps the data could be cached with the local refcount - this would also not interfere with get_mut.

workingjubilee · 2023-07-09T20:41:04Z

std::thread means you may have to bump the operations to 1000 in order to get modestly-useful benchmarks, rayon may be more like 10000.

A lot of people will be using rayon to do stuff, so it'd be good to have a test that uses that, yes.

You can run bare-metal Linux in order to get better precision for these benchmarks, I believe, as Windows doesn't expose as precise a timer, and WSL2 is Linux run in a hypervisor, so some of Windows and its quirks are still "visible" (as it is ultimately basically running Linux inside a cut-down Windows).

EricLBuehler · 2023-07-10T17:47:44Z

Thank you for the advice - I have implemented those tests! See the readme for the results. Please let me know if you think this issue should be closed, or otherwise further criteria.

I plan on adding bare-metal Linux benhmarks soon.

EricLBuehler · 2023-07-10T22:55:16Z

As of commit a37e0e6, I believe that these issues are fixed. Please let me know your thoughts.

EricLBuehler self-assigned this Jul 8, 2023

EricLBuehler added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jul 8, 2023

EricLBuehler mentioned this issue Jul 9, 2023

So you want to be considered by T-libs? #10

Closed

10 tasks

EricLBuehler closed this as completed Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inadequate microbenchmarks #9

Inadequate microbenchmarks #9

workingjubilee commented Jul 8, 2023 •

edited

Loading

workingjubilee commented Jul 9, 2023 •

edited

Loading

EricLBuehler commented Jul 9, 2023

workingjubilee commented Jul 9, 2023 •

edited

Loading

EricLBuehler commented Jul 10, 2023

EricLBuehler commented Jul 10, 2023

Inadequate microbenchmarks #9

Inadequate microbenchmarks #9

Comments

workingjubilee commented Jul 8, 2023 • edited Loading

workingjubilee commented Jul 9, 2023 • edited Loading

EricLBuehler commented Jul 9, 2023

workingjubilee commented Jul 9, 2023 • edited Loading

EricLBuehler commented Jul 10, 2023

EricLBuehler commented Jul 10, 2023

workingjubilee commented Jul 8, 2023 •

edited

Loading

workingjubilee commented Jul 9, 2023 •

edited

Loading

workingjubilee commented Jul 9, 2023 •

edited

Loading