[benchmark] Groundwork for Robust Measurements #30158

palimondo · 2020-03-02T17:41:01Z

These improvements to benchmarking infrastructure include bug fixes, various simplifications, some clean up refactoring and preliminary support for Python 3. It is pawing way for more robust measurements and has been split-off from #26462.

Please review by individual commits, which are self-contained.

… using utils/python_format.py."

Adjusted how merged PerformanceTestResults track the number of underlying samples when using quantile subsampling.

In text reports, don’t justify the last columns with unnecessary spaces.

Support for invoking benchmark drivers with min-samples and gathering environmental metadata.

For dubious result comparisons, print out empirical sample distribution (ventiles) to enable humans to reach informed decisions about these performance changes.

After commit 331c0bf from a year ago, all samples from the same run have the same num-iters.

Store the number of iterations averaged in each sample on the PerformanceTestSamples.

Removed Sample class, that was previously holding num_iters and the ordinal number of the sample.

Removed the ability to add individual samples to PTS. PerformanceTestSamples are technically not really immutable, because of exclude_outliers.

Gracefully handle the parsing of oversampled values in critical configuration, when the sampling error causes an ommision of certain quantiles from the report.

Handle optional `--meta` data in `merge` of `PerformanceTestResults`. Pick minimum from memory pages, sum the number of involuntrary contex switches and yield counts.

When merging `PerformanceTestResults`s keep the original `PerformanceTestSample`s from all independent runs. These will be used to choose the most stable (least variable) location estimate for the `ResultComparison` down the road.

To save on memory used by merged `PerformanceTestResult`s, the rarely used `PerformanceTestSample.all_samples` can gather the samples on demand from the result’s `independent_runs` instead of keeping another copy.

palimondo · 2020-03-02T18:01:12Z

@broadwaylamb I’ve tried to incorporate your work from #30085 here, but I chose to refactor few parts for more idiomatic list comprehensions where possible. I didn’t touch the additional .py files that aren’t part of main benchmarking infrastructure that I’m familiar with. Could you please have a look if I missed something? See commit ec64535e00b1e16a122d186b2c2182b7cbe3adad

palimondo · 2020-03-02T18:01:30Z

@eeckstein please review

palimondo · 2020-03-02T18:42:14Z

@swift-ci please benchmark

palimondo · 2020-03-02T18:42:32Z

@swift-ci please smoke test

palimondo · 2020-03-02T18:44:45Z

@swift-ci please smoke test os x

palimondo · 2020-03-02T18:55:38Z

@swift-ci clean smoke test os x

benchmark/scripts/Benchmark_Driver

swift-ci · 2020-03-02T19:06:16Z

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

broadwaylamb · 2020-03-02T19:09:01Z

benchmark/scripts/test_Benchmark_Driver.py

-            self.assertEqual(out.getvalue(), "Logging results to: " + log_file + "\n")
-            with open(log_file, "rU") as f:
+            self.assertEqual(out.getvalue(),
+                             'Logging results to: ' + log_file + '\n')
+            with open(log_file, 'r') as f:


The 'rU' mode's behavior is the default in Python 3, and explicitly setting this mode is deprecated (a warning is printed to stderr).

However, I'm not sure this change won't affect Python 2 compatibility. In #30085 I've defensively added a version check:

if sys.version_info < (3, 0): openmode = "rU" else: openmode = "r" with open(log_file, openmode) as f: # ...

Are you sure this is okay to just use 'r' here?

All test passed with python3 in shebang, so I think the answer is yes.

Have you tested it on Windows, where the newlines are not represented with '\n'?

I did not. @eeckstein @compnerd Do we support swift benchmarking on Windows already?

@compnerd I tried triggering “@swift-ci please test Windows platform” in hope that it runs also the python unit tests for benchmarks. Is that a correct assumption?

OK, I think it does not at the moment, because of https://github.com/apple/swift/blob/master/test/benchmark/benchmark-scripts.test-sh#L2

Display the `(?)` indicator for dubious result only for changes, never for the unchanged results. Refactored `ResultComparison.init` with simplyfied range check.

palimondo · 2020-03-02T20:49:31Z

@swift-ci please smoke test

palimondo · 2020-03-02T21:26:58Z

@swift-ci please python lint

palimondo · 2020-03-02T21:37:33Z

@swift-ci please smoke test os x

palimondo · 2020-03-02T22:29:45Z

@shahmishal The Linux smoke test failed with strange error. Definitely unrelated to my changes. Could you please have a look? Do I need to run these with some clean modifier for @swift-ci?

shahmishal · 2020-03-03T01:30:24Z

@swift-ci please clean smoke test Linux

eeckstein · 2020-03-03T13:28:09Z

This is a huge PR. It seems to me that there are completely different things mixed up here (refactoring, functional changes). Though it consists of smaller commits, I'd prefer to have smaller PRs which can land over time. This makes it easier to test and revert if something breaks.

In general, I'm kind of missing the motivation for many of those changes, e.g. which problem do the changes solve? It can be as simple as "this refactoring makes the code easier to read" up to "this fixes bug X or problem Y".

It feels like putting a lot of effort into going from A to B. But why is B better than A?

palimondo · 2020-03-03T17:40:52Z

@eeckstein This PR boils down to this LOC stat: +1,939 −2,452.
It does more with less code. As usual, all is covered by unit tests and explained in commit comments.

Since this PR is spun-off from #26462, except few final commits that are “flipping the switch” on new measurement methodology (as @gottesmm put it), the goal remains the same.

The objective of [WIP] #26462 is to find more statistically rigorous method of detecting performance changes in our benchmarks, while minimizing type I and type II errors. Our current measurement process still suffers from occasional false positives (type I errors).

The means for improving that situation is to make collection of many samples from multiple independent runs and use more sophisticated statistical analysis. The long term opportunity, if the new method from Robust Measurements pans out, is to speed up the whole process by running benchmarks in parallel (this would be too risky with status quo, because we work with aggregates of averages and do not monitor the environment properly).

If it eases things for you, I could split this PR into several more, but it would IMHO only increase your review load for no reason.

Replaced list comprehension that computes the minimum from runtimes corrected for setup overhead with a procedural style that is easier to understand.

palimondo · 2020-03-03T17:54:32Z

@swift-ci python lint

palimondo · 2020-03-03T17:55:41Z

@swift-ci please benchmark

palimondo · 2020-03-03T17:56:29Z

@swift-ci please smoke test

palimondo · 2020-03-03T18:05:41Z

@eeckstein One more thing! Maybe it wasn’t clear, but this builds on open PR #30149 which reverts the automated formatting from #29719. That is the only reason why it now looks like this is touching 17 files in benchmarks. That is all in the first commit. If we merge #30149, as I think we should, it would disappear from here and it’ll all look much simpler!

swift-ci · 2020-03-03T18:28:54Z

Performance: -O

Improvement	OLD	NEW	DELTA	RATIO
String.data.LargeUnicode	116	101	-12.9%	1.15x (?)
ObjectiveCBridgeFromNSArrayAnyObjectForced	5760	5220	-9.4%	1.10x (?)

Code size: -O

Performance: -Osize

Improvement	OLD	NEW	DELTA	RATIO
FlattenListFlatMap	7651	6920	-9.6%	1.11x (?)

Code size: -Osize

Performance: -Onone

Regression	OLD	NEW	DELTA	RATIO
NSStringConversion.MutableCopy.Rebridge.UTF8	703	761	+8.3%	0.92x (?)

Improvement	OLD	NEW	DELTA	RATIO
ObjectiveCBridgeFromNSArrayAnyObjectForced	8740	8140	-6.9%	1.07x (?)

Code size: -swiftlibs

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

palimondo · 2020-03-03T19:57:37Z

@shahmishal I don’t get this… OS X smoke test reports:

20:02:57 Failing Tests (1):
20:02:57     Swift(macosx-x86_64) :: Python/python_lint.swift

But I explicitly invoked the separate Python lint check from @swift-ci which passed. I see no issues running python_lint.py locally either. Any idea where the problem comes from?

palimondo · 2020-03-03T22:58:59Z

@swift-ci please clean smoke test OS X

palimondo · 2020-03-05T20:19:03Z

@swift-ci please test Windows platform

palimondo · 2020-03-05T20:22:24Z

@eeckstein I have replied above regarding the motivation for this PR and the perceived size of the change . Did that answer your questions? Do you have objections I should address to move this forward?

palimondo · 2020-03-05T20:24:50Z

@swift-ci please test Windows platform

palimondo · 2020-03-05T20:26:18Z

@swift-ci please clean smoke test OS X

palimondo · 2020-03-05T20:29:01Z

@shahmishal Could you please have a look at the Windows machine and also on the python lint issue above?

shahmishal · 2020-03-05T20:31:59Z

@compnerd windows nodes are having issues, can you look at them?

shahmishal · 2020-03-05T20:32:44Z

@palimondo do you have the full output from the test failure?

Swift(macosx-x86_64) :: Python/python_lint.swift

palimondo · 2020-03-05T20:35:46Z

@shahmishal Not anymore. Sorry I got inpatient and triggered a re-test a moment ago. Either it’ll pass or there’ll be a fresh log ready in about an hour? My apologies!

palimondo · 2020-03-06T04:27:17Z

@swift-ci please test Windows platform

shahmishal · 2020-10-01T06:57:37Z

Please update the base branch to main by Oct 5th otherwise the pull request will be closed automatically.

How to change the base branch: (Link)
More detail about the branch update: (Link)

palimondo added 18 commits March 1, 2020 16:10

Revert "[NFC][Python: black] Reformatted the benchmark Python sources…

4dfb3c4

… using utils/python_format.py."

[benchmark] Fix parsing delta zeroed metadata

ac294f3

[benchmark] PerformanceTestResults merge samples

9f7e782

Adjusted how merged PerformanceTestResults track the number of underlying samples when using quantile subsampling.

[benchmark] Don’t justify last column in reports

e3a639a

In text reports, don’t justify the last columns with unnecessary spaces.

[benchmark] BenchmarkDriver min-samples & metadata

979aced

Support for invoking benchmark drivers with min-samples and gathering environmental metadata.

[benchmark] [Gardening] Fix assertEqual naming

6ee22de

[benchmark] Report ventiles for dubious results

ed5940e

For dubious result comparisons, print out empirical sample distribution (ventiles) to enable humans to reach informed decisions about these performance changes.

[benchmark] Retire old num-iters in verbose format

ef2993a

After commit 331c0bf from a year ago, all samples from the same run have the same num-iters.

[benchmark] Add tests for all_samples

05b2009

[benchmark] Refactor: store num_iters on PTS

5342ab3

Store the number of iterations averaged in each sample on the PerformanceTestSamples.

[benchmark] Refactor: remove class Sample

15dcdaf

Removed Sample class, that was previously holding num_iters and the ordinal number of the sample.

[benchmark] Refactor: simpler exclude_outliers

6f0eb7b

[benchmark] Refactor: remove redundant _runtimes

abacc9f

[benchmark] More immutable PerformanceTestSamples

cafb644

Removed the ability to add individual samples to PTS. PerformanceTestSamples are technically not really immutable, because of exclude_outliers.

[benchmark] Fix crash in oversampled results

bbdcaf8

Gracefully handle the parsing of oversampled values in critical configuration, when the sampling error causes an ommision of certain quantiles from the report.

[benchmark] Collate metadata when merging PTRs

dd2d83d

Handle optional `--meta` data in `merge` of `PerformanceTestResults`. Pick minimum from memory pages, sum the number of involuntrary contex switches and yield counts.

[benchmark] Keep merged independent run samples

7d25484

When merging `PerformanceTestResults`s keep the original `PerformanceTestSample`s from all independent runs. These will be used to choose the most stable (least variable) location estimate for the `ResultComparison` down the road.

[benchmark] Override all_samples on merged results

3f93e31

To save on memory used by merged `PerformanceTestResult`s, the rarely used `PerformanceTestSample.all_samples` can gather the samples on demand from the result’s `independent_runs` instead of keeping another copy.

palimondo requested a review from eeckstein March 2, 2020 17:41

This was referenced Mar 2, 2020

Support Python 3 in the benchmark suite #30085

Merged

[benchmark] Robust Measurements #26462

Closed

broadwaylamb reviewed Mar 2, 2020

View reviewed changes

benchmark/scripts/Benchmark_Driver Outdated Show resolved Hide resolved

broadwaylamb reviewed Mar 2, 2020

View reviewed changes

palimondo added 2 commits March 2, 2020 21:42

[benchmark] Dubious indicator for changes only

85260ca

Display the `(?)` indicator for dubious result only for changes, never for the unchanged results. Refactored `ResultComparison.init` with simplyfied range check.

[benchmark] Python 3 Support

08047e4

palimondo force-pushed the hot-fuzz branch from ec64535 to 08047e4 Compare March 2, 2020 20:42

This comment has been minimized.

Sign in to view

[Gardening] Refactored runtimes correction for SO

bf06df6

Replaced list comprehension that computes the minimum from runtimes corrected for setup overhead with a procedural style that is easier to understand.

shahmishal closed this Oct 5, 2020

[benchmark] Groundwork for Robust Measurements #30158

[benchmark] Groundwork for Robust Measurements #30158

Uh oh!

Conversation

palimondo commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

Uh oh!

swift-ci commented Mar 2, 2020

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

Uh oh!

broadwaylamb Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

palimondo Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

broadwaylamb Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

palimondo Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

palimondo Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

palimondo Mar 5, 2020

Choose a reason for hiding this comment

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

This comment has been minimized.

palimondo commented Mar 2, 2020

Uh oh!

palimondo commented Mar 2, 2020

Uh oh!

shahmishal commented Mar 3, 2020

Uh oh!

eeckstein commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Mar 3, 2020

Uh oh!

palimondo commented Mar 3, 2020

Uh oh!

palimondo commented Mar 3, 2020

Uh oh!

palimondo commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swift-ci commented Mar 3, 2020

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

Uh oh!

palimondo commented Mar 2, 2020 •

edited

Loading

eeckstein commented Mar 3, 2020 •

edited

Loading

palimondo commented Mar 3, 2020 •

edited

Loading

palimondo commented Mar 3, 2020 •

edited

Loading

palimondo commented Mar 3, 2020 •

edited

Loading

palimondo commented Mar 5, 2020 •

edited

Loading

shahmishal commented Mar 5, 2020 •

edited

Loading