Skip to content

[WIP][benchmark] Flatten - Extended test family #20552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

palimondo
Copy link
Contributor

@palimondo palimondo commented Nov 13, 2018

The Flatten test family is based on recently added benchmarks FlattenListLoop and FlattenListFlatMap by @gottesmm. The original versions had unnecessarily large base workloads, which prevented more precise measurement, so their base workload was lowered by a factor of 20. See discussion on #20116 (comment)

Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names Flatten.Array.Tuple4.flatMap and Flatten.Array.Tuple4.for-in.Reservewithout going through the legacyFactor, along with extended performance test coverage inspired by these two benchmarks.

The implementation of these benchmarks has evolved substantially with @atrick's help (see conversation). Thank you Andrew!


The Flatten benchmark family tests the performance of flatMap function and functionally equivalent map.joined, along with their lazy variants relative to an imperative approach with simple for-in loop across a selection of representative types.

For transforming fully materialized collection with contiguous memory layout additional Unsafe versions were created as attempts at manual optimization that try to eliminate the abstraction overhead using assumptions about the internal memory layout of underlying data structures.

Colors

First hypotetical scenario is transforming an array of RBGA pixel values represented as struct ColorVal { let r, g, b, a: UInt8 } to flat [UInt8] in ARGB format. In case of [ColorVal], this means that the real work being performed is copying of byte swizzled raw memory, obfuscated by type casting and higher-level abstractions (structs, arrays).

The alternative type class ColorRef demonstrates the impact of using reference type.

After experimenting with Unsafe variants, conforming the ColorVal to Sequence protocol by creating a custom iterator, which performes the color component swizzling, showed promising performance (better than imperative approach). That varaint is called SwizSeq. Turns out that conforming the type to Collection protocol (in SwizCol variant) is even better, allowing compiler to optimize the lazy variants for almost 4x gain, beating even the best performing Unsafe variant that copies the colors byte-by-byte.

See http://wiki.c2.com/?SufficientlySmartCompiler

The Unsafe variants were originally meant as aspirational goals for the functional approach, but are now kept here as artifacts of partial improvements over the imperative approach, which demonstrate unexpected performance behavior.

Tuple4 and Array4

Second scenario tests the performance of flattening the compound type (Int, Int, Int, Int), typealiased as Tuple4, into [Int]. This variant compensates for the larger data type by ommiting the structural transformation. In case of fully materialized collection, [Tuple4], the real work is simply a type cast. There's currently no Array API to perform this in O(1), pending SE-0223. Therefore the Unsafe variants perform a simple memory copy.

Next variant, Array4 uses 4 element "static" array instead of the Tuple4, and is meant do demonstrate the relative cost of switching to this currency type. This is important, because Array is naturally used, thanks to its syntactic sugar, as the flattened collection in all the functional-style scenarios, i.e. the closures in flatMap and map.joined are producing "static" arrays on the fly.

The Tuple4 and Array4 type groups are varied across 3 container types:

  • Flatten.Array is a fully materialized collection,
  • Flatten.LazySeq is lazily generated sequence, and
  • Flatten.AnySeq.LazySeq is a type erased version of the latter.

After SE-0234, no standard library API returns AnySequence anymore, but the tests Flatten.LazySeq.Tuple4.flatMap and Flatten.AnySeq.LazySeq.Tuple4.flatMap hint at the hidden potential for improvement in the utter performance debacle that is Flatten.LazySeq. The AnySeq group could be removed in the future, when that deoptimization is properly addressed.


The tests follow naming convention proposed in #20334

…to handle longer benchmark names, assuming maximum length of 40 characters.
Extend parser to support benchmark names that include `-` in names, as proposed in PR swiftlang#20334.
The extended `Flatten` test family is based on recently added benchmarks `FlattenLostLoop` and `FlattenListFlatMap`. They had unnecessarily large base workloads, which prevented more precise measurement. Their base workload was lowered by a factor of 20. See discussion on swiftlang#20116 (comment)

Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names `Flatten.Array.Tuple4.flatMap` and `Flatten.Array.Tuple4.for-in.Reserve`without going through the `legacyFactor`.

Based on these two templates, this commit introduces thorough performance test coverage of the related space including:

* method chain `map.joined`
* naive for-in implementation without `reserveCapacity`
* few Unsafe variants that should serve as aspirational targets for ideally optimized code
* lazy variants
* variants for different underlying types: 4 element Array, struct and class in addition to the original 4 element tuple
* variants that flatten Sequence instead of Array

The tests follow naming convention proposed in swiftlang#20334
@palimondo
Copy link
Contributor Author

palimondo commented Nov 13, 2018

@gottesmm @atrick @eeckstein @airspeedswift Please run benchmark and review.
I have added more variants than is necessary, for you to choose here. I'll remove the excess in a followup commit once we agree on what's valuable to keep.

I'm guessing the whole AnySeq group could be dropped, since it adds no actionable info and AnySequence will not be a concern once the SE-0234 passes. Only surprising thing is Flatten.Seq.Tuple4.flatMap vs. Flatten.AnySeq.Seq.Tuple4.flatMap where the type erased version actually improves from 8270 to 1855, hinting at the hidden potential in the utter performance debacle that is Flatten.Seq. I'm guessing there is some serious de-optimization going on somewhere, because when I've profiled it, the fast one is dominated by _platform_memmove$VARIANT$Base, while the slow one drowns in _swift_release_dealloc, swift_deallocClassInstance, objc_destructInstance, _object_remove_assocations. The same affliction affects the Flatten.Array.Tuple4.lazy.flatMap, while non-lazy version flies with memmove. Is this something that would be fixed by #19690?

I was struggling to come up with proper Unsafe implementation for the Flatten.Array.Color.Val that would be an equivalent to the aspirational speed-of-light goal Flatten.Array.Tuple4.Unsafe we've cooked-up with @atrick in #20116 (comment). I know these are all "wrong", because they are x86 specific. I need help there.

I was also surprised by the slowdown between Tuple4 and Array4… maybe Swift could use some compile-time static array optimizations?

Additionally I was thinking about adding an ArrayN8 variant, that would include the same number of flattened elements as Array4, but the content of the outer array would be composed of arrays with variable lengths: 2, 2, 4, 8 (maybe up to 16) - repeating… would that make sense? How about ContiguousArray variant?

Copy link
Contributor

@atrick atrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future benchmark mainenance, it's important to understand the
justification and intention behind the benchmarks. Please add comments
to test cases explaining what is being tested and why we care about
performance of this particular test vs. more conventional ways to
write it. I know nobody has really done this in the past, but I think
it's important, especially when adding such a large number of
strangely written tests.

Aside from that, I reviewed the Unsafe benchmarks and they look correct, although they are hard to read without comments.

//
//===----------------------------------------------------------------------===//

% # Ignore the following warning. This _is_ the correct file to edit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add gyb files. Eventually we want to move the swift bm suite to swiftpm and then we cannot support gyb files anymore. Beside that, gyb files make it harder to understand and maintain the code.

Copy link
Contributor Author

@palimondo palimondo Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m really surprised by this perspective. I find that to get proper performance test coverage which tests majority of legal combinations from stdlib types and methods, GYB is indispensable especially for the ease of maintenance. The only alternative is copy & paste of some template and manual modification, which would be much more error prone.

Generally speaking, currently Swift’s biggest performance weakness is optimizer fragility. Small variations in expressing functionally equivalent code often leed to unexpected pitfalls with orders of magnitude worse performance. My idea to harden the optimizer is to produce broad benchmark coverage and file bugs for all the gotchas. Is that a bad approach?

I’m currently working on reintroducing Existential benchmark family and the first step was to create a .gyb that let’s me do sane refactoring without tearing my hair out in the 800 LOC file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 things here:

  1. gyb: I'm strongly against adding more gyb files for the reasons I explained above. Eventually we will also eliminate the existing gyb files (and BTW, that's also what the stdlib team is doing).

  2. Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes. In general I would prefer that we add fewer but more "relevant" benchmarks (whatever this means). For example, simple operations, which we expect to be optimized to a trivial code pattern, are ofter better tested with a lit test than a benchmark. On the other hand, complex operations often don't need many combinations to be benchmarked, e.g. stdlib's sort can be benchmarked with a few types/array sizes - there is not much value in exploding the whole problem space.

But: what can be done is to add a large set of benchmarks for a specific feature, which are disabled by default (with tags). Whenever someone works on that feature he/she can run this large set locally to fully verify the performance.

Copy link
Contributor Author

@palimondo palimondo Nov 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we want to move the swift bm suite to swiftpm and then we cannot support gyb files anymore.

Building benchmarks with swiftpm does not technically conflict with use of GYB, because we are not generating the swift files as a part of cmake build. These are generated manually by running generate_harness.py. That is why we are also committing the generated .swift files to git, design mandated by @gottesmm when we added the support for GYB, specifically to accommodate the future swiftpm builds.

Eventually we will also eliminate the existing gyb files (and BTW, that's also what the stdlib team is doing).

The stdlib team is doing that, because the improved expressivity of Swift has obviated the need to generate a lot of boilerplate. This argument does not apply to maintenance of benchmark families build from common template that need to be parametrized for different variants.

Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes.

This is simply not true. See #20666:

New BenchmarkCategory.existential was added to tag these tests. Running these 108 test in a manner equivalent to run_smoke_bench (time ./Benchmark_O --tags=existential --sample-time=0.0025 --num-samples=3) take 1.3 seconds on my 2008 MBP — properly sized benchmarks improve measurement precision and have negligible impact on the overall time it takes to run benchmark suite.

Copy link
Contributor Author

@palimondo palimondo Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eeckstein:

Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes.

This whole new Flatten family (final version) timed for one iteration of run_smoke_bench:

$ time ./Benchmark_O --num-samples=3 --sample-time=0.0025 {319..383}
…
Total performance tests executed: 65

real	0m1.329s
user	0m0.859s
sys	0m0.048s

On a 2008 MBP.

@palimondo
Copy link
Contributor Author

Could somebody please run benchmarks, so that we don’t talk in the abstract?

@atrick
Copy link
Contributor

atrick commented Nov 14, 2018

@swift-ci benchmark.

@swift-ci

This comment has been minimized.

@palimondo
Copy link
Contributor Author

🤔 2 removed benchmarks are in the report, but none of the newly added. I guess the new naming convention with dots somehow tripped the run_smoke_bench? I’ll investigate tomorrow…

@palimondo

This comment has been minimized.

* Documented the motivation behind the various test scenarios.
* More descriptive names for Unsafe variants.
* New groups for color swizzling using Sequence and Collection conformances.
* Reduced number of Array4 and Tuple4 variants.
@palimondo
Copy link
Contributor Author

@atrick Please have a look at the latest version and re-run benchmarks with #20667.

@palimondo palimondo changed the title WIP [benchmark] Flatten - Extended test family [benchmark] Flatten - Extended test family Nov 22, 2018
For better comparability between eager and lazy results, fully materialize the lazy sequences into array using helper function.

The `materializeSequence` still beats the performance of Array.init<Sequence> for lazy sequences in ColorVal group by an order of magnitude.
@palimondo
Copy link
Contributor Author

@eeckstein Can you please run benchmark? 🤖 still ignores me:
@swift-ci please benchmark

@eeckstein
Copy link
Contributor

@swift-ci benchmark

@swift-ci

This comment has been minimized.

@swift-ci
Copy link
Contributor

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
MapReduceAnyCollection 369 397 +7.6% 0.93x
Improvement
CharacterLiteralsLarge 108 97 -10.2% 1.11x
CharacterLiteralsSmall 348 325 -6.6% 1.07x
Added
Flatten.AnySeq.LazySeq.Tuple4.flatMap 479 1344 767
Flatten.AnySeq.LazySeq.Tuple4.for-in 479 479 479
Flatten.AnySeq.LazySeq.Tuple4.map.joined 4051 4559 4230
Flatten.Array.Array4.flatMap 505 505 505
Flatten.Array.Array4.for-in 486 487 487
Flatten.Array.Array4.joined 126 128 127
Flatten.Array.Array4.lazy.flatMap 383 383 383
Flatten.Array.Array4.lazy.joined 382 383 383
Flatten.Array.Tuple4.Unsafe.InitSeq 91 92 91
Flatten.Array.Tuple4.Unsafe.IntsReserve 87 88 87
Flatten.Array.Tuple4.flatMap 273 278 275
Flatten.Array.Tuple4.for-in 275 277 276
Flatten.Array.Tuple4.for-in.Reserve 218 221 219
Flatten.Array.Tuple4.lazy.flatMap 2188 2255 2210
Flatten.Array.Tuple4.lazy.map.joined 2186 2224 2199
Flatten.Array.Tuple4.map.joined 3643 3681 3665
Flatten.ColorRef.flatMap.Array 428 1516 791
Flatten.ColorRef.flatMap.ContArr 427 427 427
Flatten.ColorRef.flatMap.SwizCol 515 516 516
Flatten.ColorRef.flatMap.SwizSeq 563 570 565
Flatten.ColorRef.for-in 381 382 381
Flatten.ColorRef.for-in.Reserve 374 374 374
Flatten.ColorRef.lazy.flatMap.Array 2399 2504 2435
Flatten.ColorRef.lazy.flatMap.ContArr 2364 2414 2381
Flatten.ColorRef.lazy.flatMap.SwizCol 289 292 290
Flatten.ColorRef.lazy.flatMap.SwizSeq 361 361 361
Flatten.ColorRef.lazy.map.joined.Array 2390 2440 2407
Flatten.ColorRef.lazy.map.joined.ContArr 2364 2426 2385
Flatten.ColorRef.lazy.map.joined.SwizCol 289 291 290
Flatten.ColorRef.lazy.map.joined.SwizSeq 361 361 361
Flatten.ColorRef.map.joined.Array 3493 3737 3577
Flatten.ColorRef.map.joined.ContArr 3419 3481 3441
Flatten.ColorRef.map.joined.SwizCol 574 574 574
Flatten.ColorRef.map.joined.SwizSeq 654 655 654
Flatten.ColorVal.Unsafe.Bytes 77 79 78
Flatten.ColorVal.Unsafe.BytesReserve 70 72 71
Flatten.ColorVal.Unsafe.ColorValInitSeq 157 160 158
Flatten.ColorVal.Unsafe.FlatMapArray 189 193 190
Flatten.ColorVal.Unsafe.FlatMapColorVal 198 205 202
Flatten.ColorVal.Unsafe.UInt32InitSeq 99 101 100
Flatten.ColorVal.flatMap.Array 208 210 209
Flatten.ColorVal.flatMap.ContArr 198 202 199
Flatten.ColorVal.flatMap.SwizCol 109 111 110
Flatten.ColorVal.flatMap.SwizSeq 252 256 254
Flatten.ColorVal.for-in 223 226 224
Flatten.ColorVal.for-in.Reserve 220 220 220
Flatten.ColorVal.lazy.flatMap.Array 2092 2187 2144
Flatten.ColorVal.lazy.flatMap.ContArr 2118 2168 2135
Flatten.ColorVal.lazy.flatMap.SwizCol 25 26 25
Flatten.ColorVal.lazy.flatMap.SwizSeq 118 121 119
Flatten.ColorVal.lazy.map.joined.Array 2094 2169 2119
Flatten.ColorVal.lazy.map.joined.ContArr 2118 2182 2140
Flatten.ColorVal.lazy.map.joined.SwizCol 25 26 25
Flatten.ColorVal.lazy.map.joined.SwizSeq 118 120 119
Flatten.ColorVal.map.joined.Array 3240 3307 3263
Flatten.ColorVal.map.joined.ContArr 3146 3171 3154
Flatten.ColorVal.map.joined.SwizCol 104 109 106
Flatten.ColorVal.map.joined.SwizSeq 173 175 174
Flatten.LazySeq.Array4.flatMap 2234 2368 2296
Flatten.LazySeq.Array4.for-in 2691 2760 2718
Flatten.LazySeq.Array4.joined 2377 2611 2463
Flatten.LazySeq.Tuple4.flatMap 2375 2426 2392
Flatten.LazySeq.Tuple4.for-in 466 467 466
Flatten.LazySeq.Tuple4.map.joined 2347 2405 2366
Removed
FlattenListFlatMap 6398 7447 6748
FlattenListLoop 3969 5029 4323

Code size: -O

TEST OLD NEW DELTA RATIO
Improvement
main.o 59821 58845 -1.6% 1.02x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
DataCountSmall 34 37 +8.8% 0.92x (?)
DataCountMedium 37 40 +8.1% 0.93x (?)
Added
Flatten.AnySeq.LazySeq.Tuple4.flatMap 8332 9273 8658
Flatten.AnySeq.LazySeq.Tuple4.for-in 812 813 812
Flatten.AnySeq.LazySeq.Tuple4.map.joined 4181 4511 4294
Flatten.Array.Array4.flatMap 505 506 505
Flatten.Array.Array4.for-in 491 491 491
Flatten.Array.Array4.joined 149 152 150
Flatten.Array.Array4.lazy.flatMap 383 384 383
Flatten.Array.Array4.lazy.joined 383 384 384
Flatten.Array.Tuple4.Unsafe.InitSeq 97 98 97
Flatten.Array.Tuple4.Unsafe.IntsReserve 87 89 88
Flatten.Array.Tuple4.flatMap 2226 2261 2238
Flatten.Array.Tuple4.for-in 262 265 263
Flatten.Array.Tuple4.for-in.Reserve 205 208 206
Flatten.Array.Tuple4.lazy.flatMap 2202 2258 2222
Flatten.Array.Tuple4.lazy.map.joined 2206 2265 2226
Flatten.Array.Tuple4.map.joined 3707 4180 3891
Flatten.ColorRef.flatMap.Array 2496 3635 2876
Flatten.ColorRef.flatMap.ContArr 2473 2594 2513
Flatten.ColorRef.flatMap.SwizCol 516 516 516
Flatten.ColorRef.flatMap.SwizSeq 597 598 597
Flatten.ColorRef.for-in 384 384 384
Flatten.ColorRef.for-in.Reserve 376 376 376
Flatten.ColorRef.lazy.flatMap.Array 2409 2477 2432
Flatten.ColorRef.lazy.flatMap.ContArr 2407 2494 2436
Flatten.ColorRef.lazy.flatMap.SwizCol 342 349 345
Flatten.ColorRef.lazy.flatMap.SwizSeq 351 351 351
Flatten.ColorRef.lazy.map.joined.Array 2409 2455 2425
Flatten.ColorRef.lazy.map.joined.ContArr 2408 2469 2428
Flatten.ColorRef.lazy.map.joined.SwizCol 342 363 349
Flatten.ColorRef.lazy.map.joined.SwizSeq 351 351 351
Flatten.ColorRef.map.joined.Array 3503 3748 3589
Flatten.ColorRef.map.joined.ContArr 3419 3548 3463
Flatten.ColorRef.map.joined.SwizCol 623 624 623
Flatten.ColorRef.map.joined.SwizSeq 658 659 659
Flatten.ColorVal.Unsafe.Bytes 83 85 84
Flatten.ColorVal.Unsafe.BytesReserve 77 79 78
Flatten.ColorVal.Unsafe.ColorValInitSeq 180 185 182
Flatten.ColorVal.Unsafe.FlatMapArray 2161 2245 2189
Flatten.ColorVal.Unsafe.FlatMapColorVal 2171 2264 2202
Flatten.ColorVal.Unsafe.UInt32InitSeq 152 156 153
Flatten.ColorVal.flatMap.Array 2165 2233 2188
Flatten.ColorVal.flatMap.ContArr 2164 2293 2207
Flatten.ColorVal.flatMap.SwizCol 95 98 96
Flatten.ColorVal.flatMap.SwizSeq 306 308 307
Flatten.ColorVal.for-in 204 208 205
Flatten.ColorVal.for-in.Reserve 196 201 198
Flatten.ColorVal.lazy.flatMap.Array 2199 2317 2238
Flatten.ColorVal.lazy.flatMap.ContArr 2178 2262 2206
Flatten.ColorVal.lazy.flatMap.SwizCol 172 175 174
Flatten.ColorVal.lazy.flatMap.SwizSeq 184 191 188
Flatten.ColorVal.lazy.map.joined.Array 2198 2265 2221
Flatten.ColorVal.lazy.map.joined.ContArr 2178 2238 2198
Flatten.ColorVal.lazy.map.joined.SwizCol 170 175 172
Flatten.ColorVal.lazy.map.joined.SwizSeq 187 193 189
Flatten.ColorVal.map.joined.Array 3278 3340 3299
Flatten.ColorVal.map.joined.ContArr 3202 3265 3223
Flatten.ColorVal.map.joined.SwizCol 211 219 214
Flatten.ColorVal.map.joined.SwizSeq 235 236 235
Flatten.LazySeq.Array4.flatMap 2628 2702 2653
Flatten.LazySeq.Array4.for-in 2589 2726 2667
Flatten.LazySeq.Array4.joined 2301 2390 2359
Flatten.LazySeq.Tuple4.flatMap 2445 2505 2465
Flatten.LazySeq.Tuple4.for-in 454 454 454
Flatten.LazySeq.Tuple4.map.joined 2445 2492 2461
Removed
FlattenListFlatMap 44616 46834 45405
FlattenListLoop 4057 5100 4406

Code size: -Osize

TEST OLD NEW DELTA RATIO
Improvement
main.o 56785 55873 -1.6% 1.02x

Performance: -Onone

TEST MIN MAX MEAN MAX_RSS
Added
Flatten.AnySeq.LazySeq.Tuple4.flatMap 23350 24495 23746
Flatten.AnySeq.LazySeq.Tuple4.for-in 14202 14307 14238
Flatten.AnySeq.LazySeq.Tuple4.map.joined 42166 42835 42592
Flatten.Array.Array4.flatMap 7659 7694 7674
Flatten.Array.Array4.for-in 10622 10767 10704
Flatten.Array.Array4.joined 18558 18772 18630
Flatten.Array.Array4.lazy.flatMap 33491 33758 33610
Flatten.Array.Array4.lazy.joined 31674 31900 31780
Flatten.Array.Tuple4.Unsafe.InitSeq 7476 7834 7603
Flatten.Array.Tuple4.Unsafe.IntsReserve 720 720 720
Flatten.Array.Tuple4.flatMap 10953 11230 11071
Flatten.Array.Tuple4.for-in 2103 2159 2122
Flatten.Array.Tuple4.for-in.Reserve 2030 2092 2051
Flatten.Array.Tuple4.lazy.flatMap 36776 37102 36886
Flatten.Array.Tuple4.lazy.map.joined 36938 37179 37050
Flatten.Array.Tuple4.map.joined 23534 24205 23811
Flatten.ColorRef.flatMap.Array 10682 18426 13264
Flatten.ColorRef.flatMap.ContArr 10743 10897 10837
Flatten.ColorRef.flatMap.SwizCol 14048 14268 14122
Flatten.ColorRef.flatMap.SwizSeq 15594 15904 15718
Flatten.ColorRef.for-in 1990 2120 2034
Flatten.ColorRef.for-in.Reserve 1963 2056 1995
Flatten.ColorRef.lazy.flatMap.Array 37954 38227 38064
Flatten.ColorRef.lazy.flatMap.ContArr 36824 37067 36957
Flatten.ColorRef.lazy.flatMap.SwizCol 36332 36457 36398
Flatten.ColorRef.lazy.flatMap.SwizSeq 25410 25551 25463
Flatten.ColorRef.lazy.map.joined.Array 38274 38476 38379
Flatten.ColorRef.lazy.map.joined.ContArr 37118 37185 37150
Flatten.ColorRef.lazy.map.joined.SwizCol 36466 36666 36537
Flatten.ColorRef.lazy.map.joined.SwizSeq 25870 26110 25951
Flatten.ColorRef.map.joined.Array 38595 39346 38857
Flatten.ColorRef.map.joined.ContArr 35955 36103 36014
Flatten.ColorRef.map.joined.SwizCol 35510 35673 35566
Flatten.ColorRef.map.joined.SwizSeq 23858 23960 23919
Flatten.ColorVal.Unsafe.Bytes 2090 2183 2132
Flatten.ColorVal.Unsafe.BytesReserve 2054 2196 2105
Flatten.ColorVal.Unsafe.ColorValInitSeq 9107 9241 9182
Flatten.ColorVal.Unsafe.FlatMapArray 14583 15033 14822
Flatten.ColorVal.Unsafe.FlatMapColorVal 9413 9536 9463
Flatten.ColorVal.Unsafe.UInt32InitSeq 8358 8440 8399
Flatten.ColorVal.flatMap.Array 11039 11148 11088
Flatten.ColorVal.flatMap.ContArr 11122 11189 11146
Flatten.ColorVal.flatMap.SwizCol 14124 14342 14200
Flatten.ColorVal.flatMap.SwizSeq 15363 15653 15522
Flatten.ColorVal.for-in 1970 2051 1999
Flatten.ColorVal.for-in.Reserve 1922 1986 1944
Flatten.ColorVal.lazy.flatMap.Array 39159 39228 39203
Flatten.ColorVal.lazy.flatMap.ContArr 37380 37627 37495
Flatten.ColorVal.lazy.flatMap.SwizCol 33169 33336 33251
Flatten.ColorVal.lazy.flatMap.SwizSeq 25935 26508 26151
Flatten.ColorVal.lazy.map.joined.Array 39218 39314 39273
Flatten.ColorVal.lazy.map.joined.ContArr 38093 38326 38192
Flatten.ColorVal.lazy.map.joined.SwizCol 33425 33470 33445
Flatten.ColorVal.lazy.map.joined.SwizSeq 26187 26544 26323
Flatten.ColorVal.map.joined.Array 38381 38491 38443
Flatten.ColorVal.map.joined.ContArr 35656 35812 35760
Flatten.ColorVal.map.joined.SwizCol 30739 30877 30798
Flatten.ColorVal.map.joined.SwizSeq 23345 23592 23477
Flatten.LazySeq.Array4.flatMap 42559 43023 42716
Flatten.LazySeq.Array4.for-in 27009 27283 27127
Flatten.LazySeq.Array4.joined 39610 39630 39619
Flatten.LazySeq.Tuple4.flatMap 42499 42687 42571
Flatten.LazySeq.Tuple4.for-in 14119 14218 14160
Flatten.LazySeq.Tuple4.map.joined 42441 42557 42496
Removed
FlattenListFlatMap 204295 206708 205235
FlattenListLoop 41668 43017 42121
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@palimondo palimondo changed the title [benchmark] Flatten - Extended test family [WIP][benchmark] Flatten - Extended test family Dec 6, 2018
@palimondo
Copy link
Contributor Author

palimondo commented Dec 6, 2018

I understand that at a first glance the above benchmark report looks just like a lot of numbers, so let's try to tell some story around -O build performance with a proper tooling (manual table prototype) that builds on #20334. I'm very happy with how Colors turned out here. I'll return to Tuple4 and Array4 later.

I understand that at a first glance the above report looks just like a lot of numbers, so let's try to tell a story around this with a proper tooling (manual table prototype) that builds on #20334. I'm very happy with how Colors turned out here. I'll return to Tuple4 and Array4 later.

The ColorVal table is divided into 3 parts by the implementation of the RGBA -> ARGB transformation: functional at the top, Unsafe in the middle and imperative at the bottom (). They are sorted in ascending order of the measured time in the first numeric column. The functional part is further divided into flatMap and map.joined with their relative comparison in-between (m.j/fM) and into 2 column sub-groups: lazy and eager () again separated by their relative comparison (—/l). When the ratio of compared results is in 0.7—1.3 range, it has no special format. Slowdowns above 1.3 are in italics and speedups below 0.7 are in bold.

Each sub-group is further split into 4 more columns for the variants: the type/protocol used to perform the transformation:

  • CollectionSwizCol
  • SequenceSwizSeq
  • Array
  • ContiguousArrayContArr
    These are again sorted in ascending order by the fastest variant in the fastest sub-group.

ColorVal

lazy
SwizCol SwizCol
SwizSeq SwizSeq
Array Array
ContArray —/l ContArray
flatMap 25 4.4 109
118 2.1 252
2092 0.1 208
2118 0.1 198
m.j/fM 1 1 1 1 1 0.7 16 16
map.joined 25 4.2 104
118 1.5 173
2094 1.5 3240
2118 1.5 3146
Unsafe
BytesReserve70
Bytes77
UInt32InitSeq99
ColorValInitSeq157
FlatMapArray189
FlatMapColorVal198
for-in.Reserve220
for-in223

I was motivated to find ever faster implementations, because on my 2008 MBP with 2.4 GHz Intel Core 2 Duo the performance of flatMap.Array was slightly worse than the imperative baseline for-in (491 vs. 444 μs). It looks like the 10 years of marginal improvements after hitting performance wall, the Intel Xeon E5 2.7 GHz manages to about double the performance to 208 vs. 220 μs for those two cases. Which might mislead us to think that our flatMap is doing well… But that's just beefy hardware masking poor code!

From my previous explorations I knew that using .lazy implementations could lead to much faster results. But to my great surprise the Array, as well as ContiguousArray present some kind of optimization barrier for the swift compiler, because switching to .lazy resulted in 10x degradation in performance! TODO: Bug Report.

After exploring several Unsafe approaches that yielded incremental improvements, I've stumbled upon promising strategy of conforming the underlying type to Sequence protocol and the absolute champion: Collection protocol conformance, which beats even the best Unsafe.BytesReserve variant by a wide margin.

I believe this demonstrates two things:

  1. Swift has a great potential to leverage Protocol Oriented Programming into being a Sufficiently Smart Compiler and be as fast as C*,
  2. but its current implementation, even within the narrowest confines of the most optimized collection in the whole Swift Standard Library can accidentally create a spectacular de-optimization which shows that Swift's current performance is frustratingly unpredictable.

* Not an actual comparison to C. This benchmarks just demonstrates gaping performance differences within Swift Standard Library.

The ColorRef benchmarks show the impact of adding one level of indirection and introduction of reference counting.

ColorRef

lazy
SwizCol SwizCol
SwizSeq SwizSeq
Array Array
ContArray —/l ContArray
flatMap 289 1.8 515
361 1.6 563
2399 0.2 428
2364 0.2 427
m.j/fM 1 1 1 1 1 1.2 8 8
map.joined 289 2.0 574
361 1.8 654
2390 1.8 3493
2364 1.4 3419
for-in.Reserve374
for-in381

Here's the relative comparison of value types to reference types in this benchmark family.

ColorRef/ColorVal

lazy
flatMap 12 4.7 SwizCol
3.1 2.2 SwizSeq
1.1 2.1 Array
1.1 2.2 ContArray
map.joined 12 5.5 SwizCol
3.1 3.8 SwizSeq
1.1 1.1 Array
1.1 1.1 ContArray
for-in.Reserve1,7
for-in1,7

@swiftlang swiftlang deleted a comment from eeckstein Jan 5, 2019
@palimondo
Copy link
Contributor Author

@swift-ci smoke test os x

@CodaFi
Copy link
Contributor

CodaFi commented Nov 18, 2019

@palimondo are you still interested in pursuing this change?

@CodaFi
Copy link
Contributor

CodaFi commented Dec 2, 2019

It's been a while. I'm going to close this out due to age and inactivity. @palimondo The tests here are still valuable to have. If you find the time, please either reopen this or shoot us another pull request.

@CodaFi CodaFi closed this Dec 2, 2019
@palimondo
Copy link
Contributor Author

I’ll get back to this. Sorry for the long hibernation, I have somehow missed the previous ping here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants