-
Notifications
You must be signed in to change notification settings - Fork 26
Fix harness to not call timer repeatedly in the measured loop. #38
Conversation
cc @lrhn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should add a note to the changelog, bump the version, etc
In general we assume that Note thta the change which is done here makes |
I'll admit that the You could use the warm-up phase to figure out how fast (I would also never use a package just to write a benchmark, but probably because I'm not writing a package to begin with, so I'd have to create a package first). |
Then fix Golem!!!! Seriously, that's the worst reason I've ever heard. "The data set is bad, but if we fix it, it becomes inconsistent, so we're going to keep collecting bad data....". WAT? This package is being advanced as best practices for benchmarking, and it's broken. Yes, it's true that on most platforms the underlying hardware provides fast (O(cycles)) access to timer data, and the compiler will inline the virtual instance call to the stopwatch methods. But why ship a package that relies on both of those if you don't have to? What if someone wants to do something crazy like benchmark on an interpreter? We're picking up new platforms left and right, are you 100% sure that none of those will ever require software assistance (maybe even a syscall) to get timing data? Why ship something that's just going to silently give bogus data for those cases when there are numerous straightforward ways to do better? |
If benchmark is written the right way collected data is not bad. As I mentioned before one just need to make sure that I do agree that it is suboptimal, error prone and not documented anywhere, and yes it would be nice to have this fixed. However we also don't have capacity to backfill years of benchmarking data that were accumulated, so we need to make some sort of decision:
That's the perspective I was trying to communicate - sorry if it came out non appreciative of your effort to fix the harness. |
Totally agree, for handcrafted benchmarks written by someone who understands all the issues, and is aware of the limitations of the platforms they are running on. But that's not what this package is for, right?
Sure you must already have to deal with discontinuities in the data? What happens when we upgrade v8? Or buy new machines? Or install a new version of the OS, or.... ? I guess I don't really see how you can expect to both keep everything frozen so you get comparable data, and get good data.
Oh, no problem. I mean, I don't have to use the package, so... whatever :) And sorry if I came across strongly. But I do think this should be fixed.... :) |
Ok, just to make this point concretely, I took the benchmark that @munificent wrote and added a second call to I've uploaded my patch as a branch in case anyone wants to check my work (worth doing, this was a really quick hack). Numbers are below: I hacked the output to include a % overhead which is computed as the difference between the two rates as a percentage of the rate of the version with one call.
|
There is a fairly simple thing you can do to recognize fast-running |
@mraleph let me try this again, see if I can be clearer. If the benchmark is written the right way, the collected data is not bad and this change should not change the data. If the benchmark is not written the right way, then the collected data is bad, the change will change the data... and that's good. So I'm not seeing the argument for not fixing. |
I don't really disagree that this needs to be fixed. |
@leafpetersen – want to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update changelog.
Is this big enough to be considered a breaking change? Move to v2?
Has this PR been abandoned? |
I'd guess....yes. |
I appreciate the problem and the solutions proposed in this pull request. It is a genuine problem for our benchmarks that the run() method is not appropriately tuned, which is difficult to do so it is expensive enough for fast developer machines and not too expensive for our slower embedded devices. Improperly tuned benchmarks are the root cause of jumpy and noisy benchmarks. The solution requires a data driven approach. Right now benchmarks warmup for 100 ms and run for 2000 ms and the exercise method calls the run method 10 times to avoid timer overhead, which causes the benchmarks to report a number that's off by 10x. These parameters have not been tuned in many years and are not picked on the basis of real data. A couple years ago, I did some experiments where I measured the time to run each iteration and saw many benchmarks that were not properly warmed up or had jumpy performance. Now that we've unified on Once we have that data, we'll be able to make a proper solution to the root problem this PR aims to improve. I'm not opposed to this PR, but I very much would prefer an experiment on the actual production benchmarking system that confirms if it gives more accurate benchmark data. E.g. the proposed code has a cutoff where if a benchmark is just slow enough, it might run for twice as many iterations, which could result in a very jumpy benchmark if it was close to that threshold. This is a real problem in practice already for benchmarks where the run method is massively too expensive on certain platforms and the exercise method only runs a couple times or even once. |
I don't have strong opinions about how this gets fixed. I do continue to be of the opinion (supported, I believe, by the data I presented above) that calling timers repeatedly inside of the benchmark loop is very bad practice, and should not be our recommended solution. |
This PR fixes an obvious problem with the benchmark runner, which is that it includes timer calls in the thing being benchmarked. Fixing this should only improve things because (1) we no longer include unrelated computation in the results being reported and (2) it's unclear what kind of impact the system calls like So the problem of outliers in benchmarks or unstable benchmark results will potentially be improved with this PR. If we still want the noise that |
I think we should go ahead and land this and roll it into Golem. |
This CL in its current state looks like it can run the benchmark for almost twice as long as desired which is an expensive increase in load. I'm generally not happy about making ad-hoc changes to the harness without doing a qualitative analysis with the huge data set we have available. We can prioritize doing this work, Bill and I have a lot of ideas, but other work has taken priority. |
@sortie As a consumer of benchmark results I would prefer that I get accurate data even if I have to wait longer for results. I think we should be pragmatic here - land and roll this in Golem and observe the metrics. If we discover that benchmark result latency has become unacceptably high then we can easily roll the change back and figure out alternative approaches. |
lib/src/benchmark_base.dart
Outdated
} | ||
|
||
// Measures the score for this benchmark by executing it repeatedly until | ||
// time minimum has been reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making these triple ///
comments so they show up in the API docs (also for the other methods)
Improve convergence yet more
Not a breaking change. pubspec.yaml and CHANGELOG.md updated.
@mkustermann could you take another quick look? I am going to merge after that. |
I have checked on a few simple benchmarks and runtime is not significantly affected by the change (e.g. I have not seen anything like 2x slowdown) so overall the change looks safe to land. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…archive/benchmark_harness#38) Co-authored-by: Ömer Sinan Ağacan <omersa@google.com> Co-authored-by: Vyacheslav Egorov <vegorov@google.com>
The timer is probably fast enough not to matter on most platforms that you'll encounter, but why take chances with your health?
cc @mraleph @munificent