-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[benchmark] Groundwork for Robust Measurements #30158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… using utils/python_format.py."
Adjusted how merged PerformanceTestResults track the number of underlying samples when using quantile subsampling.
In text reports, don’t justify the last columns with unnecessary spaces.
Support for invoking benchmark drivers with min-samples and gathering environmental metadata.
For dubious result comparisons, print out empirical sample distribution (ventiles) to enable humans to reach informed decisions about these performance changes.
After commit 331c0bf from a year ago, all samples from the same run have the same num-iters.
Store the number of iterations averaged in each sample on the PerformanceTestSamples.
Removed Sample class, that was previously holding num_iters and the ordinal number of the sample.
Removed the ability to add individual samples to PTS. PerformanceTestSamples are technically not really immutable, because of exclude_outliers.
Gracefully handle the parsing of oversampled values in critical configuration, when the sampling error causes an ommision of certain quantiles from the report.
Handle optional `--meta` data in `merge` of `PerformanceTestResults`. Pick minimum from memory pages, sum the number of involuntrary contex switches and yield counts.
When merging `PerformanceTestResults`s keep the original `PerformanceTestSample`s from all independent runs. These will be used to choose the most stable (least variable) location estimate for the `ResultComparison` down the road.
To save on memory used by merged `PerformanceTestResult`s, the rarely used `PerformanceTestSample.all_samples` can gather the samples on demand from the result’s `independent_runs` instead of keeping another copy.
@broadwaylamb I’ve tried to incorporate your work from #30085 here, but I chose to refactor few parts for more idiomatic list comprehensions where possible. I didn’t touch the additional .py files that aren’t part of main benchmarking infrastructure that I’m familiar with. Could you please have a look if I missed something? See commit ec64535e00b1e16a122d186b2c2182b7cbe3adad |
@eeckstein please review |
@swift-ci please benchmark |
@swift-ci please smoke test |
@swift-ci please smoke test os x |
@swift-ci clean smoke test os x |
Performance: -OCode size: -OPerformance: -OsizeCode size: -OsizePerformance: -OnoneCode size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
self.assertEqual(out.getvalue(), "Logging results to: " + log_file + "\n") | ||
with open(log_file, "rU") as f: | ||
self.assertEqual(out.getvalue(), | ||
'Logging results to: ' + log_file + '\n') | ||
with open(log_file, 'r') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'rU'
mode's behavior is the default in Python 3, and explicitly setting this mode is deprecated (a warning is printed to stderr).
However, I'm not sure this change won't affect Python 2 compatibility. In #30085 I've defensively added a version check:
if sys.version_info < (3, 0):
openmode = "rU"
else:
openmode = "r"
with open(log_file, openmode) as f:
# ...
Are you sure this is okay to just use 'r'
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All test passed with python3 in shebang, so I think the answer is yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested it on Windows, where the newlines are not represented with '\n'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not. @eeckstein @compnerd Do we support swift benchmarking on Windows already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think it does not at the moment, because of https://github.com/apple/swift/blob/master/test/benchmark/benchmark-scripts.test-sh#L2
Display the `(?)` indicator for dubious result only for changes, never for the unchanged results. Refactored `ResultComparison.init` with simplyfied range check.
@swift-ci please smoke test |
@swift-ci please python lint |
This comment has been minimized.
This comment has been minimized.
@swift-ci please smoke test os x |
@shahmishal The Linux smoke test failed with strange error. Definitely unrelated to my changes. Could you please have a look? Do I need to run these with some clean modifier for @swift-ci? |
@swift-ci please clean smoke test Linux |
This is a huge PR. It seems to me that there are completely different things mixed up here (refactoring, functional changes). Though it consists of smaller commits, I'd prefer to have smaller PRs which can land over time. This makes it easier to test and revert if something breaks. In general, I'm kind of missing the motivation for many of those changes, e.g. which problem do the changes solve? It can be as simple as "this refactoring makes the code easier to read" up to "this fixes bug X or problem Y". It feels like putting a lot of effort into going from A to B. But why is B better than A? |
@eeckstein This PR boils down to this LOC stat: +1,939 −2,452. Since this PR is spun-off from #26462, except few final commits that are “flipping the switch” on new measurement methodology (as @gottesmm put it), the goal remains the same. The objective of [WIP] #26462 is to find more statistically rigorous method of detecting performance changes in our benchmarks, while minimizing type I and type II errors. Our current measurement process still suffers from occasional false positives (type I errors). The means for improving that situation is to make collection of many samples from multiple independent runs and use more sophisticated statistical analysis. The long term opportunity, if the new method from Robust Measurements pans out, is to speed up the whole process by running benchmarks in parallel (this would be too risky with status quo, because we work with aggregates of averages and do not monitor the environment properly). If it eases things for you, I could split this PR into several more, but it would IMHO only increase your review load for no reason. |
Replaced list comprehension that computes the minimum from runtimes corrected for setup overhead with a procedural style that is easier to understand.
@swift-ci python lint |
@swift-ci please benchmark |
@swift-ci please smoke test |
@eeckstein One more thing! Maybe it wasn’t clear, but this builds on open PR #30149 which reverts the automated formatting from #29719. That is the only reason why it now looks like this is touching 17 files in benchmarks. That is all in the first commit. If we merge #30149, as I think we should, it would disappear from here and it’ll all look much simpler! |
Performance: -O
Code size: -OPerformance: -Osize
Code size: -OsizePerformance: -Onone
Code size: -swiftlibsHow to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
@shahmishal I don’t get this… OS X smoke test reports:
But I explicitly invoked the separate Python lint check from @swift-ci which passed. I see no issues running |
@swift-ci please clean smoke test OS X |
@swift-ci please test Windows platform |
@eeckstein I have replied above regarding the motivation for this PR and the perceived size of the change . Did that answer your questions? Do you have objections I should address to move this forward? |
@swift-ci please test Windows platform |
@swift-ci please clean smoke test OS X |
@shahmishal Could you please have a look at the Windows machine and also on the python lint issue above? |
@compnerd windows nodes are having issues, can you look at them? |
@palimondo do you have the full output from the test failure?
|
@shahmishal Not anymore. Sorry I got inpatient and triggered a re-test a moment ago. Either it’ll pass or there’ll be a fresh log ready in about an hour? My apologies! |
@swift-ci please test Windows platform |
These improvements to benchmarking infrastructure include bug fixes, various simplifications, some clean up refactoring and preliminary support for Python 3. It is pawing way for more robust measurements and has been split-off from #26462.
Please review by individual commits, which are self-contained.