-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inadequate microbenchmarks #9
Comments
What platform do you run your benchmarks on? Rust's /// `WallTime` is the default measurement in Criterion.rs. It measures the elapsed time from the
/// beginning of a series of iterations to the end.
pub struct WallTime;
impl Measurement for WallTime {
type Intermediate = Instant;
type Value = Duration;
fn start(&self) -> Self::Intermediate {
Instant::now()
}
fn end(&self, i: Self::Intermediate) -> Self::Value {
i.elapsed()
}
fn add(&self, v1: &Self::Value, v2: &Self::Value) -> Self::Value {
*v1 + *v2
}
fn zero(&self) -> Self::Value {
Duration::from_secs(0)
}
fn to_f64(&self, val: &Self::Value) -> f64 {
val.as_nanos() as f64
}
fn formatter(&self) -> &dyn ValueFormatter {
&DurationFormatter
}
} The problem is that depending on your CPU and OS, it may expose more or less granular time measurements, so the numbers you get may fail to capture the subtleties of timing because they may have a too-high floor for how much time can be measured. And some OS have... other quirks going on in their timing APIs, unfortunately. An important benchmark would be to actually create an Arc or SharedTrc, depending, spawn some threads, and then test how the types fare in performance when subjected to the contended clone-and-drop case, so that the refcounts are going up and down while the CPU has to manage both increase and decrease across multiple threads. You may find rayon helpful for this, but also this would need to consider... many more than 100 operations, as merely spawning the threads and running the work queue with rayon would become the largest thing in the benchmark. Directly managing thread spawning might be preferable. There's also probably ways to use Criterion's Also, you may want to test something that can flush out the false sharing problem since that might be interesting if Arc and SharedTrc differ in performance there. |
Hi, @workingjubilee, I currently am running my benchmarks on a WSL2 instance on my Windows machine. I have noticed that with pure As for a contended clone-and-drop benchmark, I will be sure to add that - I will also look into using The false sharing problem seems to be something that both Arc and Trc have. However, I would think that Trc has more as the data and local refcount are kept seperate. However, this is a design necessity and I am not sure how it could be improved. Perhaps the data could be cached with the local refcount - this would also not interfere with |
std::thread means you may have to bump the operations to 1000 in order to get modestly-useful benchmarks, rayon may be more like 10000. A lot of people will be using rayon to do stuff, so it'd be good to have a test that uses that, yes. You can run bare-metal Linux in order to get better precision for these benchmarks, I believe, as Windows doesn't expose as precise a timer, and WSL2 is Linux run in a hypervisor, so some of Windows and its quirks are still "visible" (as it is ultimately basically running Linux inside a cut-down Windows). |
Thank you for the advice - I have implemented those tests! See the readme for the results. Please let me know if you think this issue should be closed, or otherwise further criteria. I plan on adding bare-metal Linux benhmarks soon. |
As of commit a37e0e6, I believe that these issues are fixed. Please let me know your thoughts. |
Your benchmarks are unrealistically simple and do not cover common concerns:
And you should make sure the tested functions are actually generating different assembly for these types and operations when compiled. Otherwise, you may be actually microbenchmarking something else, like the quality of LLVM's optimizer in piercing certain abstractions. No, "black box" functions are not actually completely immune to the optimizer's gaze.
The text was updated successfully, but these errors were encountered: