Skip to content

[benchmark] Groundwork for Robust Measurements #30158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from

Conversation

palimondo
Copy link
Contributor

@palimondo palimondo commented Mar 2, 2020

These improvements to benchmarking infrastructure include bug fixes, various simplifications, some clean up refactoring and preliminary support for Python 3. It is pawing way for more robust measurements and has been split-off from #26462.

Please review by individual commits, which are self-contained.

palimondo added 18 commits March 1, 2020 16:10
Adjusted how merged PerformanceTestResults track the number of underlying samples when using quantile subsampling.
In text reports, don’t justify the last columns with unnecessary spaces.
Support for invoking benchmark drivers with min-samples and gathering environmental metadata.
For dubious result comparisons, print out empirical sample distribution (ventiles) to enable humans to reach informed decisions about these performance changes.
After commit 331c0bf from a year ago, all samples from the same run have the same num-iters.
Store the number of iterations averaged in each sample on the PerformanceTestSamples.
Removed Sample class, that was previously holding num_iters and the ordinal number of the sample.
Removed the ability to add individual samples to PTS.
PerformanceTestSamples are technically not really immutable, because of exclude_outliers.
Gracefully handle the parsing of oversampled values in critical configuration, when the sampling error causes an ommision of certain quantiles from the report.
Handle optional `--meta` data in `merge` of `PerformanceTestResults`. Pick minimum from memory pages, sum the number of involuntrary contex switches and yield counts.
When merging `PerformanceTestResults`s keep the original `PerformanceTestSample`s from all independent runs. These will be used to choose the most stable (least variable) location estimate for the `ResultComparison` down the road.
To save on memory used by merged `PerformanceTestResult`s, the rarely used `PerformanceTestSample.all_samples` can gather the samples on demand from the result’s `independent_runs` instead of keeping another copy.
@palimondo
Copy link
Contributor Author

@broadwaylamb I’ve tried to incorporate your work from #30085 here, but I chose to refactor few parts for more idiomatic list comprehensions where possible. I didn’t touch the additional .py files that aren’t part of main benchmarking infrastructure that I’m familiar with. Could you please have a look if I missed something? See commit ec64535e00b1e16a122d186b2c2182b7cbe3adad

@palimondo
Copy link
Contributor Author

@eeckstein please review

@palimondo
Copy link
Contributor Author

@swift-ci please benchmark

@palimondo
Copy link
Contributor Author

@swift-ci please smoke test

@palimondo
Copy link
Contributor Author

@swift-ci please smoke test os x

@palimondo
Copy link
Contributor Author

@swift-ci clean smoke test os x

@swift-ci
Copy link
Contributor

swift-ci commented Mar 2, 2020

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

Comment on lines -414 to +389
self.assertEqual(out.getvalue(), "Logging results to: " + log_file + "\n")
with open(log_file, "rU") as f:
self.assertEqual(out.getvalue(),
'Logging results to: ' + log_file + '\n')
with open(log_file, 'r') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'rU' mode's behavior is the default in Python 3, and explicitly setting this mode is deprecated (a warning is printed to stderr).

However, I'm not sure this change won't affect Python 2 compatibility. In #30085 I've defensively added a version check:

if sys.version_info < (3, 0):
    openmode = "rU"
else:
    openmode = "r"
with open(log_file, openmode) as f:
    # ...

Are you sure this is okay to just use 'r' here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All test passed with python3 in shebang, so I think the answer is yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested it on Windows, where the newlines are not represented with '\n'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not. @eeckstein @compnerd Do we support swift benchmarking on Windows already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compnerd I tried triggering “@swift-ci please test Windows platform” in hope that it runs also the python unit tests for benchmarks. Is that a correct assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Display the `(?)` indicator for dubious result only for changes, never for the unchanged results.
Refactored `ResultComparison.init` with simplyfied range check.
@palimondo
Copy link
Contributor Author

@swift-ci please smoke test

@palimondo
Copy link
Contributor Author

@swift-ci please python lint

@swift-ci

This comment has been minimized.

@palimondo
Copy link
Contributor Author

@swift-ci please smoke test os x

@palimondo
Copy link
Contributor Author

@shahmishal The Linux smoke test failed with strange error. Definitely unrelated to my changes. Could you please have a look? Do I need to run these with some clean modifier for @swift-ci?

@shahmishal
Copy link
Member

@swift-ci please clean smoke test Linux

@eeckstein
Copy link
Contributor

eeckstein commented Mar 3, 2020

This is a huge PR. It seems to me that there are completely different things mixed up here (refactoring, functional changes). Though it consists of smaller commits, I'd prefer to have smaller PRs which can land over time. This makes it easier to test and revert if something breaks.

In general, I'm kind of missing the motivation for many of those changes, e.g. which problem do the changes solve? It can be as simple as "this refactoring makes the code easier to read" up to "this fixes bug X or problem Y".

It feels like putting a lot of effort into going from A to B. But why is B better than A?

@palimondo
Copy link
Contributor Author

palimondo commented Mar 3, 2020

@eeckstein This PR boils down to this LOC stat: +1,939 −2,452.
It does more with less code. As usual, all is covered by unit tests and explained in commit comments.

Since this PR is spun-off from #26462, except few final commits that are “flipping the switch” on new measurement methodology (as @gottesmm put it), the goal remains the same.

The objective of [WIP] #26462 is to find more statistically rigorous method of detecting performance changes in our benchmarks, while minimizing type I and type II errors. Our current measurement process still suffers from occasional false positives (type I errors).

The means for improving that situation is to make collection of many samples from multiple independent runs and use more sophisticated statistical analysis. The long term opportunity, if the new method from Robust Measurements pans out, is to speed up the whole process by running benchmarks in parallel (this would be too risky with status quo, because we work with aggregates of averages and do not monitor the environment properly).

If it eases things for you, I could split this PR into several more, but it would IMHO only increase your review load for no reason.

Replaced list comprehension that computes the minimum from runtimes corrected for setup overhead with a procedural style that is easier to understand.
@palimondo
Copy link
Contributor Author

@swift-ci python lint

@palimondo
Copy link
Contributor Author

@swift-ci please benchmark

@palimondo
Copy link
Contributor Author

@swift-ci please smoke test

@palimondo
Copy link
Contributor Author

palimondo commented Mar 3, 2020

@eeckstein One more thing! Maybe it wasn’t clear, but this builds on open PR #30149 which reverts the automated formatting from #29719. That is the only reason why it now looks like this is touching 17 files in benchmarks. That is all in the first commit. If we merge #30149, as I think we should, it would disappear from here and it’ll all look much simpler!

@swift-ci
Copy link
Contributor

swift-ci commented Mar 3, 2020

Performance: -O

Improvement OLD NEW DELTA RATIO
String.data.LargeUnicode 116 101 -12.9% 1.15x (?)
ObjectiveCBridgeFromNSArrayAnyObjectForced 5760 5220 -9.4% 1.10x (?)

Code size: -O

Performance: -Osize

Improvement OLD NEW DELTA RATIO
FlattenListFlatMap 7651 6920 -9.6% 1.11x (?)

Code size: -Osize

Performance: -Onone

Regression OLD NEW DELTA RATIO
NSStringConversion.MutableCopy.Rebridge.UTF8 703 761 +8.3% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
ObjectiveCBridgeFromNSArrayAnyObjectForced 8740 8140 -6.9% 1.07x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@palimondo
Copy link
Contributor Author

palimondo commented Mar 3, 2020

@shahmishal I don’t get this… OS X smoke test reports:

20:02:57 Failing Tests (1):
20:02:57     Swift(macosx-x86_64) :: Python/python_lint.swift

But I explicitly invoked the separate Python lint check from @swift-ci which passed. I see no issues running python_lint.py locally either. Any idea where the problem comes from?

@palimondo
Copy link
Contributor Author

@swift-ci please clean smoke test OS X

@palimondo
Copy link
Contributor Author

@swift-ci please test Windows platform

@palimondo
Copy link
Contributor Author

palimondo commented Mar 5, 2020

@eeckstein I have replied above regarding the motivation for this PR and the perceived size of the change . Did that answer your questions? Do you have objections I should address to move this forward?

@palimondo
Copy link
Contributor Author

@swift-ci please test Windows platform

@palimondo
Copy link
Contributor Author

@swift-ci please clean smoke test OS X

@palimondo
Copy link
Contributor Author

@shahmishal Could you please have a look at the Windows machine and also on the python lint issue above?

@shahmishal
Copy link
Member

@compnerd windows nodes are having issues, can you look at them?

@shahmishal
Copy link
Member

shahmishal commented Mar 5, 2020

@palimondo do you have the full output from the test failure?

Swift(macosx-x86_64) :: Python/python_lint.swift

@palimondo
Copy link
Contributor Author

@shahmishal Not anymore. Sorry I got inpatient and triggered a re-test a moment ago. Either it’ll pass or there’ll be a fresh log ready in about an hour? My apologies!

@palimondo
Copy link
Contributor Author

@swift-ci please test Windows platform

@shahmishal
Copy link
Member

Please update the base branch to main by Oct 5th otherwise the pull request will be closed automatically.

  • How to change the base branch: (Link)
  • More detail about the branch update: (Link)

@shahmishal shahmishal closed this Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants