-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[WIP][benchmark] Flatten - Extended test family #20552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…to handle longer benchmark names, assuming maximum length of 40 characters.
Extend parser to support benchmark names that include `-` in names, as proposed in PR swiftlang#20334.
The extended `Flatten` test family is based on recently added benchmarks `FlattenLostLoop` and `FlattenListFlatMap`. They had unnecessarily large base workloads, which prevented more precise measurement. Their base workload was lowered by a factor of 20. See discussion on swiftlang#20116 (comment) Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names `Flatten.Array.Tuple4.flatMap` and `Flatten.Array.Tuple4.for-in.Reserve`without going through the `legacyFactor`. Based on these two templates, this commit introduces thorough performance test coverage of the related space including: * method chain `map.joined` * naive for-in implementation without `reserveCapacity` * few Unsafe variants that should serve as aspirational targets for ideally optimized code * lazy variants * variants for different underlying types: 4 element Array, struct and class in addition to the original 4 element tuple * variants that flatten Sequence instead of Array The tests follow naming convention proposed in swiftlang#20334
@gottesmm @atrick @eeckstein @airspeedswift Please run benchmark and review. I'm guessing the whole I was struggling to come up with proper Unsafe implementation for the I was also surprised by the slowdown between Additionally I was thinking about adding an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future benchmark mainenance, it's important to understand the
justification and intention behind the benchmarks. Please add comments
to test cases explaining what is being tested and why we care about
performance of this particular test vs. more conventional ways to
write it. I know nobody has really done this in the past, but I think
it's important, especially when adding such a large number of
strangely written tests.
Aside from that, I reviewed the Unsafe benchmarks and they look correct, although they are hard to read without comments.
// | ||
//===----------------------------------------------------------------------===// | ||
|
||
% # Ignore the following warning. This _is_ the correct file to edit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add gyb files. Eventually we want to move the swift bm suite to swiftpm and then we cannot support gyb files anymore. Beside that, gyb files make it harder to understand and maintain the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m really surprised by this perspective. I find that to get proper performance test coverage which tests majority of legal combinations from stdlib types and methods, GYB is indispensable especially for the ease of maintenance. The only alternative is copy & paste of some template and manual modification, which would be much more error prone.
Generally speaking, currently Swift’s biggest performance weakness is optimizer fragility. Small variations in expressing functionally equivalent code often leed to unexpected pitfalls with orders of magnitude worse performance. My idea to harden the optimizer is to produce broad benchmark coverage and file bugs for all the gotchas. Is that a bad approach?
I’m currently working on reintroducing Existential benchmark family and the first step was to create a .gyb
that let’s me do sane refactoring without tearing my hair out in the 800 LOC file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 things here:
-
gyb: I'm strongly against adding more gyb files for the reasons I explained above. Eventually we will also eliminate the existing gyb files (and BTW, that's also what the stdlib team is doing).
-
Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes. In general I would prefer that we add fewer but more "relevant" benchmarks (whatever this means). For example, simple operations, which we expect to be optimized to a trivial code pattern, are ofter better tested with a lit test than a benchmark. On the other hand, complex operations often don't need many combinations to be benchmarked, e.g. stdlib's sort can be benchmarked with a few types/array sizes - there is not much value in exploding the whole problem space.
But: what can be done is to add a large set of benchmarks for a specific feature, which are disabled by default (with tags). Whenever someone works on that feature he/she can run this large set locally to fully verify the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we want to move the swift bm suite to swiftpm and then we cannot support gyb files anymore.
Building benchmarks with swiftpm does not technically conflict with use of GYB, because we are not generating the swift files as a part of cmake build. These are generated manually by running generate_harness.py
. That is why we are also committing the generated .swift
files to git, design mandated by @gottesmm when we added the support for GYB, specifically to accommodate the future swiftpm builds.
Eventually we will also eliminate the existing gyb files (and BTW, that's also what the stdlib team is doing).
The stdlib team is doing that, because the improved expressivity of Swift has obviated the need to generate a lot of boilerplate. This argument does not apply to maintenance of benchmark families build from common template that need to be parametrized for different variants.
Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes.
This is simply not true. See #20666:
New
BenchmarkCategory.existential
was added to tag these tests. Running these 108 test in a manner equivalent torun_smoke_bench
(time ./Benchmark_O --tags=existential --sample-time=0.0025 --num-samples=3
) take 1.3 seconds on my 2008 MBP — properly sized benchmarks improve measurement precision and have negligible impact on the overall time it takes to run benchmark suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes.
This whole new Flatten
family (final version) timed for one iteration of run_smoke_bench
:
$ time ./Benchmark_O --num-samples=3 --sample-time=0.0025 {319..383}
…
Total performance tests executed: 65
real 0m1.329s
user 0m0.859s
sys 0m0.048s
On a 2008 MBP.
Could somebody please run benchmarks, so that we don’t talk in the abstract? |
@swift-ci benchmark. |
This comment has been minimized.
This comment has been minimized.
🤔 2 removed benchmarks are in the report, but none of the newly added. I guess the new naming convention with dots somehow tripped the run_smoke_bench? I’ll investigate tomorrow… |
This comment has been minimized.
This comment has been minimized.
* Documented the motivation behind the various test scenarios. * More descriptive names for Unsafe variants. * New groups for color swizzling using Sequence and Collection conformances. * Reduced number of Array4 and Tuple4 variants.
For better comparability between eager and lazy results, fully materialize the lazy sequences into array using helper function. The `materializeSequence` still beats the performance of Array.init<Sequence> for lazy sequences in ColorVal group by an order of magnitude.
@eeckstein Can you please run benchmark? 🤖 still ignores me: |
@swift-ci benchmark |
This comment has been minimized.
This comment has been minimized.
Build comment file:Performance: -O
Code size: -O
Performance: -Osize
Code size: -Osize
Performance: -Onone
How to read the dataThe tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.If you see any unexpected regressions, you should consider fixing the Noise: Sometimes the performance results (not code size!) contain false Hardware Overview
|
I understand that at a first glance the above benchmark report looks just like a lot of numbers, so let's try to tell some story around I understand that at a first glance the above report looks just like a lot of numbers, so let's try to tell a story around this with a proper tooling (manual table prototype) that builds on #20334. I'm very happy with how The Each sub-group is further split into 4 more columns for the variants: the type/protocol used to perform the transformation:
|
lazy | —
| |||||||||
---|---|---|---|---|---|---|---|---|---|---|
SwizCol | SwizCol | |||||||||
SwizSeq
| SwizSeq
| |||||||||
Array
| Array
| |||||||||
ContArray
| —/l | ContArray
| ||||||||
flatMap
| 25 | 4.4 | 109 | |||||||
118 | 2.1 | 252 | ||||||||
2092 | 0.1 | 208 | ||||||||
2118 | 0.1 | 198 | ||||||||
m.j/fM | 1 | 1 | 1 | 1 | 1 | 0.7 | 16 | 16 | ||
map.joined
| 25 | 4.2 | 104 | |||||||
118 | 1.5 | 173 | ||||||||
2094 | 1.5 | 3240 | ||||||||
2118 | 1.5 | 3146 | ||||||||
Unsafe | ||||||||||
BytesReserve | 70 | |||||||||
Bytes | 77 | |||||||||
UInt32InitSeq | 99 | |||||||||
ColorValInitSeq | 157 | |||||||||
FlatMapArray | 189 | |||||||||
FlatMapColorVal | 198 | |||||||||
— | ||||||||||
for-in.Reserve | 220 | |||||||||
for-in | 223 |
I was motivated to find ever faster implementations, because on my 2008 MBP with 2.4 GHz Intel Core 2 Duo the performance of flatMap.Array
was slightly worse than the imperative baseline for-in
(491 vs. 444 μs). It looks like the 10 years of marginal improvements after hitting performance wall, the Intel Xeon E5 2.7 GHz manages to about double the performance to 208 vs. 220 μs for those two cases. Which might mislead us to think that our flatMap
is doing well… But that's just beefy hardware masking poor code!
From my previous explorations I knew that using .lazy
implementations could lead to much faster results. But to my great surprise the Array
, as well as ContiguousArray
present some kind of optimization barrier for the swift compiler, because switching to .lazy
resulted in 10x degradation in performance! TODO: Bug Report.
After exploring several Unsafe
approaches that yielded incremental improvements, I've stumbled upon promising strategy of conforming the underlying type to Sequence
protocol and the absolute champion: Collection
protocol conformance, which beats even the best Unsafe.BytesReserve variant by a wide margin.
I believe this demonstrates two things:
- Swift has a great potential to leverage Protocol Oriented Programming into being a Sufficiently Smart Compiler and be as fast as C*,
- but its current implementation, even within the narrowest confines of the most optimized collection in the whole Swift Standard Library can accidentally create a spectacular de-optimization which shows that Swift's current performance is frustratingly unpredictable.
* Not an actual comparison to C. This benchmarks just demonstrates gaping performance differences within Swift Standard Library.
The ColorRef
benchmarks show the impact of adding one level of indirection and introduction of reference counting.
ColorRef
lazy | —
| |||||||||
---|---|---|---|---|---|---|---|---|---|---|
SwizCol | SwizCol | |||||||||
SwizSeq
| SwizSeq
| |||||||||
Array
| Array
| |||||||||
ContArray
| —/l | ContArray
| ||||||||
flatMap
| 289 | 1.8 | 515 | |||||||
361 | 1.6 | 563 | ||||||||
2399 | 0.2 | 428 | ||||||||
2364 | 0.2 | 427 | ||||||||
m.j/fM | 1 | 1 | 1 | 1 | 1 | 1.2 | 8 | 8 | ||
map.joined
| 289 | 2.0 | 574 | |||||||
361 | 1.8 | 654 | ||||||||
2390 | 1.8 | 3493 | ||||||||
2364 | 1.4 | 3419 | ||||||||
— | ||||||||||
for-in.Reserve | 374 | |||||||||
for-in | 381 |
Here's the relative comparison of value types to reference types in this benchmark family.
ColorRef
/ColorVal
lazy | — | ||
---|---|---|---|
flatMap
| 12 | 4.7 | SwizCol
|
3.1 | 2.2 | SwizSeq
| |
1.1 | 2.1 | Array
| |
1.1 | 2.2 | ContArray
| |
map.joined
| 12 | 5.5 | SwizCol
|
3.1 | 3.8 | SwizSeq
| |
1.1 | 1.1 | Array
| |
1.1 | 1.1 | ContArray
| |
— | |||
for-in.Reserve | 1,7 | ||
for-in | 1,7 |
@swift-ci smoke test os x |
@palimondo are you still interested in pursuing this change? |
It's been a while. I'm going to close this out due to age and inactivity. @palimondo The tests here are still valuable to have. If you find the time, please either reopen this or shoot us another pull request. |
I’ll get back to this. Sorry for the long hibernation, I have somehow missed the previous ping here. |
The
Flatten
test family is based on recently added benchmarksFlattenListLoop
andFlattenListFlatMap
by @gottesmm. The original versions had unnecessarily large base workloads, which prevented more precise measurement, so their base workload was lowered by a factor of 20. See discussion on #20116 (comment)Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names
Flatten.Array.Tuple4.flatMap
andFlatten.Array.Tuple4.for-in.Reserve
without going through thelegacyFactor
, along with extended performance test coverage inspired by these two benchmarks.The implementation of these benchmarks has evolved substantially with @atrick's help (see conversation). Thank you Andrew!
The
Flatten
benchmark family tests the performance offlatMap
function and functionally equivalentmap.joined
, along with their lazy variants relative to an imperative approach with simple for-in loop across a selection of representative types.For transforming fully materialized collection with contiguous memory layout additional Unsafe versions were created as attempts at manual optimization that try to eliminate the abstraction overhead using assumptions about the internal memory layout of underlying data structures.
Colors
First hypotetical scenario is transforming an array of RBGA pixel values represented as
struct ColorVal { let r, g, b, a: UInt8 }
to flat[UInt8]
in ARGB format. In case of[ColorVal]
, this means that the real work being performed is copying of byte swizzled raw memory, obfuscated by type casting and higher-level abstractions (structs, arrays).The alternative type
class ColorRef
demonstrates the impact of using reference type.After experimenting with Unsafe variants, conforming the
ColorVal
toSequence
protocol by creating a custom iterator, which performes the color component swizzling, showed promising performance (better than imperative approach). That varaint is calledSwizSeq
. Turns out that conforming the type toCollection
protocol (inSwizCol
variant) is even better, allowing compiler to optimize the lazy variants for almost 4x gain, beating even the best performing Unsafe variant that copies the colors byte-by-byte.See http://wiki.c2.com/?SufficientlySmartCompiler
The Unsafe variants were originally meant as aspirational goals for the functional approach, but are now kept here as artifacts of partial improvements over the imperative approach, which demonstrate unexpected performance behavior.
Tuple4 and Array4
Second scenario tests the performance of flattening the compound type
(Int, Int, Int, Int)
, typealiased asTuple4
, into[Int]
. This variant compensates for the larger data type by ommiting the structural transformation. In case of fully materialized collection,[Tuple4]
, the real work is simply a type cast. There's currently no Array API to perform this in O(1), pending SE-0223. Therefore the Unsafe variants perform a simple memory copy.Next variant,
Array4
uses 4 element "static" array instead of theTuple4
, and is meant do demonstrate the relative cost of switching to this currency type. This is important, because Array is naturally used, thanks to its syntactic sugar, as the flattened collection in all the functional-style scenarios, i.e. the closures inflatMap
andmap.joined
are producing "static" arrays on the fly.The
Tuple4
andArray4
type groups are varied across 3 container types:Flatten.Array
is a fully materialized collection,Flatten.LazySeq
is lazily generated sequence, andFlatten.AnySeq.LazySeq
is a type erased version of the latter.After SE-0234, no standard library API returns
AnySequence
anymore, but the testsFlatten.LazySeq.Tuple4.flatMap
andFlatten.AnySeq.LazySeq.Tuple4.flatMap
hint at the hidden potential for improvement in the utter performance debacle that isFlatten.LazySeq
. TheAnySeq
group could be removed in the future, when that deoptimization is properly addressed.The tests follow naming convention proposed in #20334