Fix harness to not call timer repeatedly in the measured loop. #38

leafpetersen · 2019-02-08T06:28:28Z

The timer is probably fast enough not to matter on most platforms that you'll encounter, but why take chances with your health?

cc @mraleph @munificent

leafpetersen · 2019-02-08T06:47:42Z

cc @lrhn

lib/src/benchmark_base.dart

kevmoo

you should add a note to the changelog, bump the version, etc

lib/src/benchmark_base.dart

mraleph · 2019-02-09T11:33:25Z

In general we assume that exercise is more expensive than accessing the time. This can be tuned on the per-benchmark basis. I agree this is suboptimal.

Note thta the change which is done here makes benchmark_harness incompatible with the same code living inside Golem repo - we might want to keep them aligned if possible, but changing code in the Golem repo would invalidate benchmark results that have already been accumulated.

lrhn · 2019-02-11T10:27:49Z

I'll admit that the e.elapsedMilliseconds in every loop is exactly what I do all the time.
If execise is so fast that it makes a difference, I'd add a repeat loop inside exercise. Maybe let exercise return an int which is the "number of operations", instead of counting one call as one operation.

You could use the warm-up phase to figure out how fast exercise is, and then repeat calls to exercise a number of times between checking elapsed milliseconds, to ensure that you it takes at least ~50 ms per round.

(I would also never use a package just to write a benchmark, but probably because I'm not writing a package to begin with, so I'd have to create a package first).

leafpetersen · 2019-02-11T17:31:15Z

Note thta the change which is done here makes benchmark_harness incompatible with the same code living inside Golem repo - we might want to keep them aligned if possible, but changing code in the Golem repo would invalidate benchmark results that have already been accumulated.

Then fix Golem!!!! Seriously, that's the worst reason I've ever heard. "The data set is bad, but if we fix it, it becomes inconsistent, so we're going to keep collecting bad data....". WAT?

This package is being advanced as best practices for benchmarking, and it's broken. Yes, it's true that on most platforms the underlying hardware provides fast (O(cycles)) access to timer data, and the compiler will inline the virtual instance call to the stopwatch methods. But why ship a package that relies on both of those if you don't have to? What if someone wants to do something crazy like benchmark on an interpreter? We're picking up new platforms left and right, are you 100% sure that none of those will ever require software assistance (maybe even a syscall) to get timing data? Why ship something that's just going to silently give bogus data for those cases when there are numerous straightforward ways to do better?

mraleph · 2019-02-11T18:20:30Z

Then fix Golem!!!! Seriously, that's the worst reason I've ever heard. "The data set is bad, but if we fix it, it becomes inconsistent, so we're going to keep collecting bad data....". WAT?

If benchmark is written the right way collected data is not bad. As I mentioned before one just need to make sure that exercise takes significantly longer time to hide the cost of consulting the timer.

I do agree that it is suboptimal, error prone and not documented anywhere, and yes it would be nice to have this fixed.

However we also don't have capacity to backfill years of benchmarking data that were accumulated, so we need to make some sort of decision:

make incompatible change to all benchmarks, purge all historic data, do not backfill
use updated harness on new benchmarks only

That's the perspective I was trying to communicate - sorry if it came out non appreciative of your effort to fix the harness.

leafpetersen · 2019-02-11T18:35:05Z

If benchmark is written the right way collected data is not bad. As I mentioned before one just need to make sure that exercise takes significantly longer time to hide the cost of consulting the timer.

Totally agree, for handcrafted benchmarks written by someone who understands all the issues, and is aware of the limitations of the platforms they are running on. But that's not what this package is for, right?

However we also don't have capacity to backfill years of benchmarking data that were accumulated, so we need to make some sort of decision:

Sure you must already have to deal with discontinuities in the data? What happens when we upgrade v8? Or buy new machines? Or install a new version of the OS, or.... ? I guess I don't really see how you can expect to both keep everything frozen so you get comparable data, and get good data.

That's the perspective I was trying to communicate - sorry if it came out non appreciative of your effort to fix the harness.

Oh, no problem. I mean, I don't have to use the package, so... whatever :) And sorry if I came across strongly. But I do think this should be fixed.... :)

leafpetersen · 2019-02-11T20:59:19Z

Ok, just to make this point concretely, I took the benchmark that @munificent wrote and added a second call to elapsedMilliseconds to every iteration (changing it from doing one call, to two). Running this on my mac, using the AOT compiler, shows a 30% slowdown for adding the second call on the smallest size of his benchmark, decreasing slowly to essentially zero overhead at the largest size. This suggests that the existing call to elapsedMilliseconds is adding approximately 30% to the measured time on the small sizes. It is also possible that adding the second call to elapsedMilliseconds disables inlining of the first: but this is also consistent with my point: you cannot rely on this being inlined and fast. I would claim that if two different Dart team members can get bit by this, then this is a problem.

I've uploaded my patch as a branch in case anyone wants to check my work (worth doing, this was a really quick hack).

Numbers are below: I hacked the output to include a % overhead which is computed as the difference between the two rates as a percentage of the rate of the version with one call.

  0 iterate            9036.14 spreads/ms      32.01% overhead                 --------------------
   0 List for          10225.36 spreads/ms      32.71% overhead  1.13x baseline ======================
   0 resize and set     9415.40 spreads/ms      31.92% overhead  1.04x baseline ====================
   0 addAll()           7779.46 spreads/ms      28.66% overhead  0.86x baseline =================
   0 forEach()          8729.16 spreads/ms      28.69% overhead  0.97x baseline ===================

   1 iterate            7170.16 spreads/ms      26.77% overhead                 --------------------
   1 List for           8993.50 spreads/ms      30.33% overhead  1.25x baseline =========================
   1 resize and set     7153.62 spreads/ms      26.19% overhead  1.00x baseline ===================
   1 addAll()           6439.56 spreads/ms      24.14% overhead  0.90x baseline =================
   1 forEach()          7604.76 spreads/ms      28.01% overhead  1.06x baseline =====================

   2 iterate            6095.72 spreads/ms      23.46% overhead                 --------------------
   2 List for           8115.98 spreads/ms      28.47% overhead  1.33x baseline ==========================
   2 resize and set     6211.76 spreads/ms      23.31% overhead  1.02x baseline ====================
   2 addAll()           5769.72 spreads/ms      22.14% overhead  0.95x baseline ==================
   2 forEach()          6699.96 spreads/ms      24.81% overhead  1.10x baseline =====================

   5 iterate            3394.40 spreads/ms      14.44% overhead                 --------------------
   5 List for           4977.94 spreads/ms      19.88% overhead  1.47x baseline =============================
   5 resize and set     4969.28 spreads/ms      19.31% overhead  1.46x baseline =============================
   5 addAll()           3648.06 spreads/ms      16.29% overhead  1.07x baseline =====================
   5 forEach()          4121.58 spreads/ms      16.48% overhead  1.21x baseline ========================

  10 iterate            2253.42 spreads/ms      10.12% overhead                 --------------------
  10 List for           3805.74 spreads/ms      15.99% overhead  1.69x baseline =================================
  10 resize and set     3197.16 spreads/ms      13.54% overhead  1.42x baseline ============================
  10 addAll()           3183.38 spreads/ms      14.12% overhead  1.41x baseline ============================
  10 forEach()          3009.68 spreads/ms      14.79% overhead  1.34x baseline ==========================

  20 iterate            1188.58 spreads/ms       4.32% overhead                 --------------------
  20 List for           1992.42 spreads/ms       8.94% overhead  1.68x baseline =================================
  20 resize and set     2099.58 spreads/ms       9.68% overhead  1.77x baseline ===================================
  20 addAll()           2051.76 spreads/ms      11.14% overhead  1.73x baseline ==================================
  20 forEach()          1604.06 spreads/ms       7.30% overhead  1.35x baseline ==========================

  50 iterate             555.70 spreads/ms       2.27% overhead                 --------------------
  50 List for            925.52 spreads/ms       3.12% overhead  1.67x baseline =================================
  50 resize and set     1040.44 spreads/ms       4.57% overhead  1.87x baseline =====================================
  50 addAll()            982.38 spreads/ms       3.98% overhead  1.77x baseline ===================================
  50 forEach()           766.18 spreads/ms       5.14% overhead  1.38x baseline ===========================

 100 iterate             291.92 spreads/ms       1.55% overhead                 --------------------
 100 List for            494.50 spreads/ms       2.31% overhead  1.69x baseline =================================
 100 resize and set      555.18 spreads/ms       5.23% overhead  1.90x baseline ======================================
 100 addAll()            535.92 spreads/ms       3.69% overhead  1.84x baseline ====================================
 100 forEach()           400.38 spreads/ms       1.50% overhead  1.37x baseline ===========================

1000 iterate              32.18 spreads/ms      -0.87% overhead                 --------------------
1000 List for             57.10 spreads/ms       0.35% overhead  1.77x baseline ===================================
1000 resize and set       61.70 spreads/ms       0.39% overhead  1.92x baseline ======================================
1000 addAll()             57.50 spreads/ms      -0.94% overhead  1.79x baseline ===================================
1000 forEach()            45.38 spreads/ms      -0.13% overhead  1.41x baseline ============================

lrhn · 2019-02-12T14:37:36Z

There is a fairly simple thing you can do to recognize fast-running run methods. I'd probably prefer to more thorough rewrite, but this seems to work: lrhn/benchmark_harness@master...lrhn:short-run

leafpetersen · 2019-03-21T04:41:21Z

If benchmark is written the right way collected data is not bad

@mraleph let me try this again, see if I can be clearer. If the benchmark is written the right way, the collected data is not bad and this change should not change the data. If the benchmark is not written the right way, then the collected data is bad, the change will change the data... and that's good. So I'm not seeing the argument for not fixing.

mraleph · 2019-03-21T07:05:42Z

I don't really disagree that this needs to be fixed.

lib/src/benchmark_base.dart

kevmoo · 2019-03-21T22:31:35Z

@leafpetersen – want to run dartfmt -w .? Then we should be good.

kevmoo

Need to update changelog.

Is this big enough to be considered a breaking change? Move to v2?

mateusfccp · 2022-04-12T14:24:25Z

Has this PR been abandoned?

kevmoo · 2022-04-12T14:26:04Z

Has this PR been abandoned?

I'd guess....yes.

sortie · 2022-05-03T12:42:19Z

I appreciate the problem and the solutions proposed in this pull request. It is a genuine problem for our benchmarks that the run() method is not appropriately tuned, which is difficult to do so it is expensive enough for fast developer machines and not too expensive for our slower embedded devices. Improperly tuned benchmarks are the root cause of jumpy and noisy benchmarks.

The solution requires a data driven approach. Right now benchmarks warmup for 100 ms and run for 2000 ms and the exercise method calls the run method 10 times to avoid timer overhead, which causes the benchmarks to report a number that's off by 10x. These parameters have not been tuned in many years and are not picked on the basis of real data.

A couple years ago, I did some experiments where I measured the time to run each iteration and saw many benchmarks that were not properly warmed up or had jumpy performance. Now that we've unified on package:benchmark_harness, the next step is to add this profiling support so we can identity benchmarks that aren't properly tuned and fix them. We'll likely also get valuable data to help tune the virtual machine when the performance does not reach a steady state.

Once we have that data, we'll be able to make a proper solution to the root problem this PR aims to improve. I'm not opposed to this PR, but I very much would prefer an experiment on the actual production benchmarking system that confirms if it gives more accurate benchmark data. E.g. the proposed code has a cutoff where if a benchmark is just slow enough, it might run for twice as many iterations, which could result in a very jumpy benchmark if it was close to that threshold. This is a real problem in practice already for benchmarks where the run method is massively too expensive on certain platforms and the exercise method only runs a couple times or even once.

leafpetersen · 2022-05-03T16:57:58Z

I don't have strong opinions about how this gets fixed. I do continue to be of the opinion (supported, I believe, by the data I presented above) that calling timers repeatedly inside of the benchmark loop is very bad practice, and should not be our recommended solution.

osa1 · 2022-06-20T10:34:46Z

I did some experiments where I measured the time to run each iteration and saw many benchmarks that were not properly warmed up or had jumpy performance. Now that we've unified on package:benchmark_harness, the next step is to add this profiling support so we can identity benchmarks that aren't properly tuned and fix them. We'll likely also get valuable data to help tune the virtual machine when the performance does not reach a steady state.

Once we have that data, we'll be able to make a proper solution to the root problem this PR aims to improve

This PR fixes an obvious problem with the benchmark runner, which is that it includes timer calls in the thing being benchmarked. Fixing this should only improve things because (1) we no longer include unrelated computation in the results being reported and (2) it's unclear what kind of impact the system calls like clock_gettime on Linux and mach_timebase_info on OSX will have on benchmark results so it's a good idea to avoid them. For example, will the context switch for these calls require another warmup round?

So the problem of outliers in benchmarks or unstable benchmark results will potentially be improved with this PR. If we still want the noise that elapsedMicroseconds (clock_gettime, mach_timebase_info, ...) adds to the benchmarks we can always have benchmarks specifically for that, with elapsedMicroseconds calls added in the code being benchmarked. It doesn't make sense to report invalid results to users just because we can somehow use noisy results to improve VM performance, Golem, etc.

mraleph · 2022-06-22T03:44:06Z

I think we should go ahead and land this and roll it into Golem.

sortie · 2022-06-22T11:24:36Z

This CL in its current state looks like it can run the benchmark for almost twice as long as desired which is an expensive increase in load.

I'm generally not happy about making ad-hoc changes to the harness without doing a qualitative analysis with the huge data set we have available. We can prioritize doing this work, Bill and I have a lot of ideas, but other work has taken priority.

lib/src/benchmark_base.dart

mraleph · 2022-06-23T23:08:49Z

@sortie As a consumer of benchmark results I would prefer that I get accurate data even if I have to wait longer for results. I think we should be pragmatic here - land and roll this in Golem and observe the metrics. If we discover that benchmark result latency has become unacceptably high then we can easily roll the change back and figure out alternative approaches.

mkustermann · 2022-06-24T13:51:59Z

lib/src/benchmark_base.dart

  }

+  // Measures the score for this benchmark by executing it repeatedly until
+  // time minimum has been reached.


Consider making these triple /// comments so they show up in the API docs (also for the other methods)

Improve convergence yet more

Not a breaking change. pubspec.yaml and CHANGELOG.md updated.

mraleph · 2022-07-05T13:49:33Z

@mkustermann could you take another quick look? I am going to merge after that.

mraleph · 2022-07-05T13:50:46Z

I have checked on a few simple benchmarks and runtime is not significantly affected by the change (e.g. I have not seen anything like 2x slowdown) so overall the change looks safe to land.

mkustermann

LGTM

pubspec.yaml

…archive/benchmark_harness#38) Co-authored-by: Ömer Sinan Ağacan <omersa@google.com> Co-authored-by: Vyacheslav Egorov <vegorov@google.com>

leafpetersen added 2 commits February 7, 2019 22:10

Move timer call out of the measured loop

fff0755

Handle overflow, fix off by *2 bug

68eb6ae

leafpetersen requested a review from kevmoo February 8, 2019 06:28

googlebot added the cla: yes label Feb 8, 2019

kevmoo suggested changes Feb 8, 2019

View reviewed changes

lib/src/benchmark_base.dart Outdated Show resolved Hide resolved

leafpetersen mentioned this pull request Feb 8, 2019

Start working on a benchmark for the spread desugaring. dart-lang/language#210

Merged

Remove pointless overflow handling

0af3924

kevmoo approved these changes Feb 9, 2019

View reviewed changes

munificent reviewed Feb 9, 2019

View reviewed changes

lib/src/benchmark_base.dart Outdated Show resolved Hide resolved

mraleph approved these changes Mar 21, 2019

View reviewed changes

lrhn reviewed Mar 21, 2019

View reviewed changes

lib/src/benchmark_base.dart Outdated Show resolved Hide resolved

lrhn reviewed Mar 21, 2019

View reviewed changes

lib/src/benchmark_base.dart Outdated Show resolved Hide resolved

lrhn reviewed Mar 21, 2019

View reviewed changes

lib/src/benchmark_base.dart Show resolved Hide resolved

kevmoo previously requested changes Mar 21, 2019

View reviewed changes

filiph mentioned this pull request Oct 29, 2024

Stats for the measurements? dart-lang/tools#796

Open

devoncarew mentioned this pull request Oct 29, 2024

adjust guidence wrt this being the recommended benchmarking solution? dart-lang/tools#798

Closed

osa1 added 2 commits June 20, 2022 11:33

Merge remote-tracking branch 'origin/master' into fix_harness

c190aa3

Remove redundant ignore, also fixes formatting

ab6fb99

mraleph reviewed Jun 22, 2022

View reviewed changes

lib/src/benchmark_base.dart Outdated Show resolved Hide resolved

Tweak measureFor to improve convergence speed

27a3508

mraleph requested a review from mkustermann June 24, 2022 12:45

mkustermann approved these changes Jun 24, 2022

View reviewed changes

mraleph added 3 commits July 1, 2022 15:01

Use doc-comments

6a6598f

Update CHANGELOG.md and pubspec.yaml

593958d

Improve convergence yet more

Fix analysis errors

3c8381e

Fix tests

6a5ae94

mkustermann approved these changes Jul 6, 2022

View reviewed changes

pubspec.yaml Outdated Show resolved Hide resolved

Fix last issues. Prepare to publish

8b35d2a

mraleph merged commit f4ed0fc into master Jul 6, 2022

mraleph deleted the fix_harness branch July 6, 2022 10:45

mraleph mentioned this pull request Jul 7, 2022

Reduce measurement overhead by allowing some degree of measurement jitter. #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix harness to not call timer repeatedly in the measured loop. #38

Fix harness to not call timer repeatedly in the measured loop. #38

leafpetersen commented Feb 8, 2019

leafpetersen commented Feb 8, 2019

kevmoo left a comment

mraleph commented Feb 9, 2019 •

edited

Loading

lrhn commented Feb 11, 2019

leafpetersen commented Feb 11, 2019

mraleph commented Feb 11, 2019

leafpetersen commented Feb 11, 2019

leafpetersen commented Feb 11, 2019

lrhn commented Feb 12, 2019

leafpetersen commented Mar 21, 2019

mraleph commented Mar 21, 2019

kevmoo commented Mar 21, 2019

kevmoo left a comment

mateusfccp commented Apr 12, 2022 •

edited

Loading

kevmoo commented Apr 12, 2022

sortie commented May 3, 2022

leafpetersen commented May 3, 2022

osa1 commented Jun 20, 2022

mraleph commented Jun 22, 2022

sortie commented Jun 22, 2022

mraleph commented Jun 23, 2022

mkustermann Jun 24, 2022

mraleph commented Jul 5, 2022

mraleph commented Jul 5, 2022

mkustermann left a comment

Fix harness to not call timer repeatedly in the measured loop. #38

Fix harness to not call timer repeatedly in the measured loop. #38

Conversation

leafpetersen commented Feb 8, 2019

leafpetersen commented Feb 8, 2019

kevmoo left a comment

Choose a reason for hiding this comment

mraleph commented Feb 9, 2019 • edited Loading

lrhn commented Feb 11, 2019

leafpetersen commented Feb 11, 2019

mraleph commented Feb 11, 2019

leafpetersen commented Feb 11, 2019

leafpetersen commented Feb 11, 2019

lrhn commented Feb 12, 2019

leafpetersen commented Mar 21, 2019

mraleph commented Mar 21, 2019

kevmoo commented Mar 21, 2019

kevmoo left a comment

Choose a reason for hiding this comment

mateusfccp commented Apr 12, 2022 • edited Loading

kevmoo commented Apr 12, 2022

sortie commented May 3, 2022

leafpetersen commented May 3, 2022

osa1 commented Jun 20, 2022

mraleph commented Jun 22, 2022

sortie commented Jun 22, 2022

mraleph commented Jun 23, 2022

mkustermann Jun 24, 2022

Choose a reason for hiding this comment

mraleph commented Jul 5, 2022

mraleph commented Jul 5, 2022

mkustermann left a comment

Choose a reason for hiding this comment

mraleph commented Feb 9, 2019 •

edited

Loading

mateusfccp commented Apr 12, 2022 •

edited

Loading