[WIP][benchmark] Flatten - Extended test family #20552

palimondo · 2018-11-13T23:09:43Z

The Flatten test family is based on recently added benchmarks FlattenListLoop and FlattenListFlatMap by @gottesmm. The original versions had unnecessarily large base workloads, which prevented more precise measurement, so their base workload was lowered by a factor of 20. See discussion on #20116 (comment)

Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names Flatten.Array.Tuple4.flatMap and Flatten.Array.Tuple4.for-in.Reservewithout going through the legacyFactor, along with extended performance test coverage inspired by these two benchmarks.

The implementation of these benchmarks has evolved substantially with @atrick's help (see conversation). Thank you Andrew!

The Flatten benchmark family tests the performance of flatMap function and functionally equivalent map.joined, along with their lazy variants relative to an imperative approach with simple for-in loop across a selection of representative types.

For transforming fully materialized collection with contiguous memory layout additional Unsafe versions were created as attempts at manual optimization that try to eliminate the abstraction overhead using assumptions about the internal memory layout of underlying data structures.

Colors

First hypotetical scenario is transforming an array of RBGA pixel values represented as struct ColorVal { let r, g, b, a: UInt8 } to flat [UInt8] in ARGB format. In case of [ColorVal], this means that the real work being performed is copying of byte swizzled raw memory, obfuscated by type casting and higher-level abstractions (structs, arrays).

The alternative type class ColorRef demonstrates the impact of using reference type.

After experimenting with Unsafe variants, conforming the ColorVal to Sequence protocol by creating a custom iterator, which performes the color component swizzling, showed promising performance (better than imperative approach). That varaint is called SwizSeq. Turns out that conforming the type to Collection protocol (in SwizCol variant) is even better, allowing compiler to optimize the lazy variants for almost 4x gain, beating even the best performing Unsafe variant that copies the colors byte-by-byte.

See http://wiki.c2.com/?SufficientlySmartCompiler

The Unsafe variants were originally meant as aspirational goals for the functional approach, but are now kept here as artifacts of partial improvements over the imperative approach, which demonstrate unexpected performance behavior.

Tuple4 and Array4

Second scenario tests the performance of flattening the compound type (Int, Int, Int, Int), typealiased as Tuple4, into [Int]. This variant compensates for the larger data type by ommiting the structural transformation. In case of fully materialized collection, [Tuple4], the real work is simply a type cast. There's currently no Array API to perform this in O(1), pending SE-0223. Therefore the Unsafe variants perform a simple memory copy.

Next variant, Array4 uses 4 element "static" array instead of the Tuple4, and is meant do demonstrate the relative cost of switching to this currency type. This is important, because Array is naturally used, thanks to its syntactic sugar, as the flattened collection in all the functional-style scenarios, i.e. the closures in flatMap and map.joined are producing "static" arrays on the fly.

The Tuple4 and Array4 type groups are varied across 3 container types:

Flatten.Array is a fully materialized collection,
Flatten.LazySeq is lazily generated sequence, and
Flatten.AnySeq.LazySeq is a type erased version of the latter.

After SE-0234, no standard library API returns AnySequence anymore, but the tests Flatten.LazySeq.Tuple4.flatMap and Flatten.AnySeq.LazySeq.Tuple4.flatMap hint at the hidden potential for improvement in the utter performance debacle that is Flatten.LazySeq. The AnySeq group could be removed in the future, when that deoptimization is properly addressed.

The tests follow naming convention proposed in #20334

…to handle longer benchmark names, assuming maximum length of 40 characters.

Extend parser to support benchmark names that include `-` in names, as proposed in PR swiftlang#20334.

The extended `Flatten` test family is based on recently added benchmarks `FlattenLostLoop` and `FlattenListFlatMap`. They had unnecessarily large base workloads, which prevented more precise measurement. Their base workload was lowered by a factor of 20. See discussion on swiftlang#20116 (comment) Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names `Flatten.Array.Tuple4.flatMap` and `Flatten.Array.Tuple4.for-in.Reserve`without going through the `legacyFactor`. Based on these two templates, this commit introduces thorough performance test coverage of the related space including: * method chain `map.joined` * naive for-in implementation without `reserveCapacity` * few Unsafe variants that should serve as aspirational targets for ideally optimized code * lazy variants * variants for different underlying types: 4 element Array, struct and class in addition to the original 4 element tuple * variants that flatten Sequence instead of Array The tests follow naming convention proposed in swiftlang#20334

palimondo · 2018-11-13T23:54:16Z

@gottesmm @atrick @eeckstein @airspeedswift Please run benchmark and review.
I have added more variants than is necessary, for you to choose here. I'll remove the excess in a followup commit once we agree on what's valuable to keep.

I'm guessing the whole AnySeq group could be dropped, since it adds no actionable info and AnySequence will not be a concern once the SE-0234 passes. Only surprising thing is Flatten.Seq.Tuple4.flatMap vs. Flatten.AnySeq.Seq.Tuple4.flatMap where the type erased version actually improves from 8270 to 1855, hinting at the hidden potential in the utter performance debacle that is Flatten.Seq. I'm guessing there is some serious de-optimization going on somewhere, because when I've profiled it, the fast one is dominated by _platform_memmove$VARIANT$Base, while the slow one drowns in _swift_release_dealloc, swift_deallocClassInstance, objc_destructInstance, _object_remove_assocations. The same affliction affects the Flatten.Array.Tuple4.lazy.flatMap, while non-lazy version flies with memmove. Is this something that would be fixed by #19690?

I was struggling to come up with proper Unsafe implementation for the Flatten.Array.Color.Val that would be an equivalent to the aspirational speed-of-light goal Flatten.Array.Tuple4.Unsafe we've cooked-up with @atrick in #20116 (comment). I know these are all "wrong", because they are x86 specific. I need help there.

I was also surprised by the slowdown between Tuple4 and Array4… maybe Swift could use some compile-time static array optimizations?

Additionally I was thinking about adding an ArrayN8 variant, that would include the same number of flattened elements as Array4, but the content of the outer array would be composed of arrays with variable lengths: 2, 2, 4, 8 (maybe up to 16) - repeating… would that make sense? How about ContiguousArray variant?

atrick

For future benchmark mainenance, it's important to understand the
justification and intention behind the benchmarks. Please add comments
to test cases explaining what is being tested and why we care about
performance of this particular test vs. more conventional ways to
write it. I know nobody has really done this in the past, but I think
it's important, especially when adding such a large number of
strangely written tests.

Aside from that, I reviewed the Unsafe benchmarks and they look correct, although they are hard to read without comments.

benchmark/single-source/Flatten.swift

eeckstein · 2018-11-14T19:58:29Z

benchmark/single-source/Flatten.swift.gyb

+//
+//===----------------------------------------------------------------------===//
+
+% # Ignore the following warning. This _is_ the correct file to edit.


Please don't add gyb files. Eventually we want to move the swift bm suite to swiftpm and then we cannot support gyb files anymore. Beside that, gyb files make it harder to understand and maintain the code.

I’m really surprised by this perspective. I find that to get proper performance test coverage which tests majority of legal combinations from stdlib types and methods, GYB is indispensable especially for the ease of maintenance. The only alternative is copy & paste of some template and manual modification, which would be much more error prone.

Generally speaking, currently Swift’s biggest performance weakness is optimizer fragility. Small variations in expressing functionally equivalent code often leed to unexpected pitfalls with orders of magnitude worse performance. My idea to harden the optimizer is to produce broad benchmark coverage and file bugs for all the gotchas. Is that a bad approach?

I’m currently working on reintroducing Existential benchmark family and the first step was to create a .gyb that let’s me do sane refactoring without tearing my hair out in the 800 LOC file.

There are 2 things here:

gyb: I'm strongly against adding more gyb files for the reasons I explained above. Eventually we will also eliminate the existing gyb files (and BTW, that's also what the stdlib team is doing).

Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes. In general I would prefer that we add fewer but more "relevant" benchmarks (whatever this means). For example, simple operations, which we expect to be optimized to a trivial code pattern, are ofter better tested with a lit test than a benchmark. On the other hand, complex operations often don't need many combinations to be benchmarked, e.g. stdlib's sort can be benchmarked with a few types/array sizes - there is not much value in exploding the whole problem space.

But: what can be done is to add a large set of benchmarks for a specific feature, which are disabled by default (with tags). Whenever someone works on that feature he/she can run this large set locally to fully verify the performance.

Eventually we want to move the swift bm suite to swiftpm and then we cannot support gyb files anymore.

Building benchmarks with swiftpm does not technically conflict with use of GYB, because we are not generating the swift files as a part of cmake build. These are generated manually by running generate_harness.py. That is why we are also committing the generated .swift files to git, design mandated by @gottesmm when we added the support for GYB, specifically to accommodate the future swiftpm builds.

Eventually we will also eliminate the existing gyb files (and BTW, that's also what the stdlib team is doing).

The stdlib team is doing that, because the improved expressivity of Swift has obviated the need to generate a lot of boilerplate. This argument does not apply to maintenance of benchmark families build from common template that need to be parametrized for different variants.

Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes.

This is simply not true. See #20666:

New BenchmarkCategory.existential was added to tag these tests. Running these 108 test in a manner equivalent to run_smoke_bench (time ./Benchmark_O --tags=existential --sample-time=0.0025 --num-samples=3) take 1.3 seconds on my 2008 MBP — properly sized benchmarks improve measurement precision and have negligible impact on the overall time it takes to run benchmark suite.

@eeckstein:

Adding a lot of benchmarks just to cover all combinations is a problematic approach and not workable in the long term, because it will result in long benchmark runtimes.

This whole new Flatten family (final version) timed for one iteration of run_smoke_bench:

$ time ./Benchmark_O --num-samples=3 --sample-time=0.0025 {319..383} … Total performance tests executed: 65 real 0m1.329s user 0m0.859s sys 0m0.048s

On a 2008 MBP.

palimondo · 2018-11-14T22:13:00Z

Could somebody please run benchmarks, so that we don’t talk in the abstract?

atrick · 2018-11-14T22:29:02Z

@swift-ci benchmark.

palimondo · 2018-11-14T23:25:14Z

🤔 2 removed benchmarks are in the report, but none of the newly added. I guess the new naming convention with dots somehow tripped the run_smoke_bench? I’ll investigate tomorrow…

* Documented the motivation behind the various test scenarios. * More descriptive names for Unsafe variants. * New groups for color swizzling using Sequence and Collection conformances. * Reduced number of Array4 and Tuple4 variants.

palimondo · 2018-11-22T20:50:24Z

@atrick Please have a look at the latest version and re-run benchmarks with #20667.

For better comparability between eager and lazy results, fully materialize the lazy sequences into array using helper function. The `materializeSequence` still beats the performance of Array.init<Sequence> for lazy sequences in ColorVal group by an order of magnitude.

palimondo · 2018-11-26T20:57:47Z

@eeckstein Can you please run benchmark? 🤖 still ignores me:
@swift-ci please benchmark

eeckstein · 2018-11-28T20:38:25Z

@swift-ci benchmark

swift-ci · 2018-11-28T21:28:58Z

Build comment file:

Performance: -O

TEST	OLD	NEW	DELTA	RATIO
Regression
MapReduceAnyCollection	369	397	+7.6%	0.93x
Improvement
CharacterLiteralsLarge	108	97	-10.2%	1.11x
CharacterLiteralsSmall	348	325	-6.6%	1.07x
Added
Flatten.AnySeq.LazySeq.Tuple4.flatMap	479	1344	767	—
Flatten.AnySeq.LazySeq.Tuple4.for-in	479	479	479	—
Flatten.AnySeq.LazySeq.Tuple4.map.joined	4051	4559	4230	—
Flatten.Array.Array4.flatMap	505	505	505	—
Flatten.Array.Array4.for-in	486	487	487	—
Flatten.Array.Array4.joined	126	128	127	—
Flatten.Array.Array4.lazy.flatMap	383	383	383	—
Flatten.Array.Array4.lazy.joined	382	383	383	—
Flatten.Array.Tuple4.Unsafe.InitSeq	91	92	91	—
Flatten.Array.Tuple4.Unsafe.IntsReserve	87	88	87	—
Flatten.Array.Tuple4.flatMap	273	278	275	—
Flatten.Array.Tuple4.for-in	275	277	276	—
Flatten.Array.Tuple4.for-in.Reserve	218	221	219	—
Flatten.Array.Tuple4.lazy.flatMap	2188	2255	2210	—
Flatten.Array.Tuple4.lazy.map.joined	2186	2224	2199	—
Flatten.Array.Tuple4.map.joined	3643	3681	3665	—
Flatten.ColorRef.flatMap.Array	428	1516	791	—
Flatten.ColorRef.flatMap.ContArr	427	427	427	—
Flatten.ColorRef.flatMap.SwizCol	515	516	516	—
Flatten.ColorRef.flatMap.SwizSeq	563	570	565	—
Flatten.ColorRef.for-in	381	382	381	—
Flatten.ColorRef.for-in.Reserve	374	374	374	—
Flatten.ColorRef.lazy.flatMap.Array	2399	2504	2435	—
Flatten.ColorRef.lazy.flatMap.ContArr	2364	2414	2381	—
Flatten.ColorRef.lazy.flatMap.SwizCol	289	292	290	—
Flatten.ColorRef.lazy.flatMap.SwizSeq	361	361	361	—
Flatten.ColorRef.lazy.map.joined.Array	2390	2440	2407	—
Flatten.ColorRef.lazy.map.joined.ContArr	2364	2426	2385	—
Flatten.ColorRef.lazy.map.joined.SwizCol	289	291	290	—
Flatten.ColorRef.lazy.map.joined.SwizSeq	361	361	361	—
Flatten.ColorRef.map.joined.Array	3493	3737	3577	—
Flatten.ColorRef.map.joined.ContArr	3419	3481	3441	—
Flatten.ColorRef.map.joined.SwizCol	574	574	574	—
Flatten.ColorRef.map.joined.SwizSeq	654	655	654	—
Flatten.ColorVal.Unsafe.Bytes	77	79	78	—
Flatten.ColorVal.Unsafe.BytesReserve	70	72	71	—
Flatten.ColorVal.Unsafe.ColorValInitSeq	157	160	158	—
Flatten.ColorVal.Unsafe.FlatMapArray	189	193	190	—
Flatten.ColorVal.Unsafe.FlatMapColorVal	198	205	202	—
Flatten.ColorVal.Unsafe.UInt32InitSeq	99	101	100	—
Flatten.ColorVal.flatMap.Array	208	210	209	—
Flatten.ColorVal.flatMap.ContArr	198	202	199	—
Flatten.ColorVal.flatMap.SwizCol	109	111	110	—
Flatten.ColorVal.flatMap.SwizSeq	252	256	254	—
Flatten.ColorVal.for-in	223	226	224	—
Flatten.ColorVal.for-in.Reserve	220	220	220	—
Flatten.ColorVal.lazy.flatMap.Array	2092	2187	2144	—
Flatten.ColorVal.lazy.flatMap.ContArr	2118	2168	2135	—
Flatten.ColorVal.lazy.flatMap.SwizCol	25	26	25	—
Flatten.ColorVal.lazy.flatMap.SwizSeq	118	121	119	—
Flatten.ColorVal.lazy.map.joined.Array	2094	2169	2119	—
Flatten.ColorVal.lazy.map.joined.ContArr	2118	2182	2140	—
Flatten.ColorVal.lazy.map.joined.SwizCol	25	26	25	—
Flatten.ColorVal.lazy.map.joined.SwizSeq	118	120	119	—
Flatten.ColorVal.map.joined.Array	3240	3307	3263	—
Flatten.ColorVal.map.joined.ContArr	3146	3171	3154	—
Flatten.ColorVal.map.joined.SwizCol	104	109	106	—
Flatten.ColorVal.map.joined.SwizSeq	173	175	174	—
Flatten.LazySeq.Array4.flatMap	2234	2368	2296	—
Flatten.LazySeq.Array4.for-in	2691	2760	2718	—
Flatten.LazySeq.Array4.joined	2377	2611	2463	—
Flatten.LazySeq.Tuple4.flatMap	2375	2426	2392	—
Flatten.LazySeq.Tuple4.for-in	466	467	466	—
Flatten.LazySeq.Tuple4.map.joined	2347	2405	2366	—
Removed
FlattenListFlatMap	6398	7447	6748	—
FlattenListLoop	3969	5029	4323	—

Code size: -O

TEST	OLD	NEW	DELTA	RATIO
Improvement
main.o	59821	58845	-1.6%	1.02x

Performance: -Osize

TEST	OLD	NEW	DELTA	RATIO
Regression
DataCountSmall	34	37	+8.8%	0.92x (?)
DataCountMedium	37	40	+8.1%	0.93x (?)
Added
Flatten.AnySeq.LazySeq.Tuple4.flatMap	8332	9273	8658	—
Flatten.AnySeq.LazySeq.Tuple4.for-in	812	813	812	—
Flatten.AnySeq.LazySeq.Tuple4.map.joined	4181	4511	4294	—
Flatten.Array.Array4.flatMap	505	506	505	—
Flatten.Array.Array4.for-in	491	491	491	—
Flatten.Array.Array4.joined	149	152	150	—
Flatten.Array.Array4.lazy.flatMap	383	384	383	—
Flatten.Array.Array4.lazy.joined	383	384	384	—
Flatten.Array.Tuple4.Unsafe.InitSeq	97	98	97	—
Flatten.Array.Tuple4.Unsafe.IntsReserve	87	89	88	—
Flatten.Array.Tuple4.flatMap	2226	2261	2238	—
Flatten.Array.Tuple4.for-in	262	265	263	—
Flatten.Array.Tuple4.for-in.Reserve	205	208	206	—
Flatten.Array.Tuple4.lazy.flatMap	2202	2258	2222	—
Flatten.Array.Tuple4.lazy.map.joined	2206	2265	2226	—
Flatten.Array.Tuple4.map.joined	3707	4180	3891	—
Flatten.ColorRef.flatMap.Array	2496	3635	2876	—
Flatten.ColorRef.flatMap.ContArr	2473	2594	2513	—
Flatten.ColorRef.flatMap.SwizCol	516	516	516	—
Flatten.ColorRef.flatMap.SwizSeq	597	598	597	—
Flatten.ColorRef.for-in	384	384	384	—
Flatten.ColorRef.for-in.Reserve	376	376	376	—
Flatten.ColorRef.lazy.flatMap.Array	2409	2477	2432	—
Flatten.ColorRef.lazy.flatMap.ContArr	2407	2494	2436	—
Flatten.ColorRef.lazy.flatMap.SwizCol	342	349	345	—
Flatten.ColorRef.lazy.flatMap.SwizSeq	351	351	351	—
Flatten.ColorRef.lazy.map.joined.Array	2409	2455	2425	—
Flatten.ColorRef.lazy.map.joined.ContArr	2408	2469	2428	—
Flatten.ColorRef.lazy.map.joined.SwizCol	342	363	349	—
Flatten.ColorRef.lazy.map.joined.SwizSeq	351	351	351	—
Flatten.ColorRef.map.joined.Array	3503	3748	3589	—
Flatten.ColorRef.map.joined.ContArr	3419	3548	3463	—
Flatten.ColorRef.map.joined.SwizCol	623	624	623	—
Flatten.ColorRef.map.joined.SwizSeq	658	659	659	—
Flatten.ColorVal.Unsafe.Bytes	83	85	84	—
Flatten.ColorVal.Unsafe.BytesReserve	77	79	78	—
Flatten.ColorVal.Unsafe.ColorValInitSeq	180	185	182	—
Flatten.ColorVal.Unsafe.FlatMapArray	2161	2245	2189	—
Flatten.ColorVal.Unsafe.FlatMapColorVal	2171	2264	2202	—
Flatten.ColorVal.Unsafe.UInt32InitSeq	152	156	153	—
Flatten.ColorVal.flatMap.Array	2165	2233	2188	—
Flatten.ColorVal.flatMap.ContArr	2164	2293	2207	—
Flatten.ColorVal.flatMap.SwizCol	95	98	96	—
Flatten.ColorVal.flatMap.SwizSeq	306	308	307	—
Flatten.ColorVal.for-in	204	208	205	—
Flatten.ColorVal.for-in.Reserve	196	201	198	—
Flatten.ColorVal.lazy.flatMap.Array	2199	2317	2238	—
Flatten.ColorVal.lazy.flatMap.ContArr	2178	2262	2206	—
Flatten.ColorVal.lazy.flatMap.SwizCol	172	175	174	—
Flatten.ColorVal.lazy.flatMap.SwizSeq	184	191	188	—
Flatten.ColorVal.lazy.map.joined.Array	2198	2265	2221	—
Flatten.ColorVal.lazy.map.joined.ContArr	2178	2238	2198	—
Flatten.ColorVal.lazy.map.joined.SwizCol	170	175	172	—
Flatten.ColorVal.lazy.map.joined.SwizSeq	187	193	189	—
Flatten.ColorVal.map.joined.Array	3278	3340	3299	—
Flatten.ColorVal.map.joined.ContArr	3202	3265	3223	—
Flatten.ColorVal.map.joined.SwizCol	211	219	214	—
Flatten.ColorVal.map.joined.SwizSeq	235	236	235	—
Flatten.LazySeq.Array4.flatMap	2628	2702	2653	—
Flatten.LazySeq.Array4.for-in	2589	2726	2667	—
Flatten.LazySeq.Array4.joined	2301	2390	2359	—
Flatten.LazySeq.Tuple4.flatMap	2445	2505	2465	—
Flatten.LazySeq.Tuple4.for-in	454	454	454	—
Flatten.LazySeq.Tuple4.map.joined	2445	2492	2461	—
Removed
FlattenListFlatMap	44616	46834	45405	—
FlattenListLoop	4057	5100	4406	—

Code size: -Osize

TEST	OLD	NEW	DELTA	RATIO
Improvement
main.o	56785	55873	-1.6%	1.02x

Performance: -Onone

TEST	MIN	MAX	MEAN	MAX_RSS
Added
Flatten.AnySeq.LazySeq.Tuple4.flatMap	23350	24495	23746	—
Flatten.AnySeq.LazySeq.Tuple4.for-in	14202	14307	14238	—
Flatten.AnySeq.LazySeq.Tuple4.map.joined	42166	42835	42592	—
Flatten.Array.Array4.flatMap	7659	7694	7674	—
Flatten.Array.Array4.for-in	10622	10767	10704	—
Flatten.Array.Array4.joined	18558	18772	18630	—
Flatten.Array.Array4.lazy.flatMap	33491	33758	33610	—
Flatten.Array.Array4.lazy.joined	31674	31900	31780	—
Flatten.Array.Tuple4.Unsafe.InitSeq	7476	7834	7603	—
Flatten.Array.Tuple4.Unsafe.IntsReserve	720	720	720	—
Flatten.Array.Tuple4.flatMap	10953	11230	11071	—
Flatten.Array.Tuple4.for-in	2103	2159	2122	—
Flatten.Array.Tuple4.for-in.Reserve	2030	2092	2051	—
Flatten.Array.Tuple4.lazy.flatMap	36776	37102	36886	—
Flatten.Array.Tuple4.lazy.map.joined	36938	37179	37050	—
Flatten.Array.Tuple4.map.joined	23534	24205	23811	—
Flatten.ColorRef.flatMap.Array	10682	18426	13264	—
Flatten.ColorRef.flatMap.ContArr	10743	10897	10837	—
Flatten.ColorRef.flatMap.SwizCol	14048	14268	14122	—
Flatten.ColorRef.flatMap.SwizSeq	15594	15904	15718	—
Flatten.ColorRef.for-in	1990	2120	2034	—
Flatten.ColorRef.for-in.Reserve	1963	2056	1995	—
Flatten.ColorRef.lazy.flatMap.Array	37954	38227	38064	—
Flatten.ColorRef.lazy.flatMap.ContArr	36824	37067	36957	—
Flatten.ColorRef.lazy.flatMap.SwizCol	36332	36457	36398	—
Flatten.ColorRef.lazy.flatMap.SwizSeq	25410	25551	25463	—
Flatten.ColorRef.lazy.map.joined.Array	38274	38476	38379	—
Flatten.ColorRef.lazy.map.joined.ContArr	37118	37185	37150	—
Flatten.ColorRef.lazy.map.joined.SwizCol	36466	36666	36537	—
Flatten.ColorRef.lazy.map.joined.SwizSeq	25870	26110	25951	—
Flatten.ColorRef.map.joined.Array	38595	39346	38857	—
Flatten.ColorRef.map.joined.ContArr	35955	36103	36014	—
Flatten.ColorRef.map.joined.SwizCol	35510	35673	35566	—
Flatten.ColorRef.map.joined.SwizSeq	23858	23960	23919	—
Flatten.ColorVal.Unsafe.Bytes	2090	2183	2132	—
Flatten.ColorVal.Unsafe.BytesReserve	2054	2196	2105	—
Flatten.ColorVal.Unsafe.ColorValInitSeq	9107	9241	9182	—
Flatten.ColorVal.Unsafe.FlatMapArray	14583	15033	14822	—
Flatten.ColorVal.Unsafe.FlatMapColorVal	9413	9536	9463	—
Flatten.ColorVal.Unsafe.UInt32InitSeq	8358	8440	8399	—
Flatten.ColorVal.flatMap.Array	11039	11148	11088	—
Flatten.ColorVal.flatMap.ContArr	11122	11189	11146	—
Flatten.ColorVal.flatMap.SwizCol	14124	14342	14200	—
Flatten.ColorVal.flatMap.SwizSeq	15363	15653	15522	—
Flatten.ColorVal.for-in	1970	2051	1999	—
Flatten.ColorVal.for-in.Reserve	1922	1986	1944	—
Flatten.ColorVal.lazy.flatMap.Array	39159	39228	39203	—
Flatten.ColorVal.lazy.flatMap.ContArr	37380	37627	37495	—
Flatten.ColorVal.lazy.flatMap.SwizCol	33169	33336	33251	—
Flatten.ColorVal.lazy.flatMap.SwizSeq	25935	26508	26151	—
Flatten.ColorVal.lazy.map.joined.Array	39218	39314	39273	—
Flatten.ColorVal.lazy.map.joined.ContArr	38093	38326	38192	—
Flatten.ColorVal.lazy.map.joined.SwizCol	33425	33470	33445	—
Flatten.ColorVal.lazy.map.joined.SwizSeq	26187	26544	26323	—
Flatten.ColorVal.map.joined.Array	38381	38491	38443	—
Flatten.ColorVal.map.joined.ContArr	35656	35812	35760	—
Flatten.ColorVal.map.joined.SwizCol	30739	30877	30798	—
Flatten.ColorVal.map.joined.SwizSeq	23345	23592	23477	—
Flatten.LazySeq.Array4.flatMap	42559	43023	42716	—
Flatten.LazySeq.Array4.for-in	27009	27283	27127	—
Flatten.LazySeq.Array4.joined	39610	39630	39619	—
Flatten.LazySeq.Tuple4.flatMap	42499	42687	42571	—
Flatten.LazySeq.Tuple4.for-in	14119	14218	14160	—
Flatten.LazySeq.Tuple4.map.joined	42441	42557	42496	—
Removed
FlattenListFlatMap	204295	206708	205235	—
FlattenListLoop	41668	43017	42121	—

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

--------------

palimondo · 2018-12-06T18:39:56Z

I understand that at a first glance the above benchmark report looks just like a lot of numbers, so let's try to tell some story around -O build performance with a proper tooling (manual table prototype) that builds on #20334. I'm very happy with how Colors turned out here. I'll return to Tuple4 and Array4 later.

I understand that at a first glance the above report looks just like a lot of numbers, so let's try to tell a story around this with a proper tooling (manual table prototype) that builds on #20334. I'm very happy with how Colors turned out here. I'll return to Tuple4 and Array4 later.

The ColorVal table is divided into 3 parts by the implementation of the RGBA -> ARGB transformation: functional at the top, Unsafe in the middle and imperative at the bottom (—). They are sorted in ascending order of the measured time in the first numeric column. The functional part is further divided into flatMap and map.joined with their relative comparison in-between (^m.j/fM) and into 2 column sub-groups: lazy and eager (—) again separated by their relative comparison (_—/l). When the ratio of compared results is in 0.7—1.3 range, it has no special format. Slowdowns above 1.3 are in_italics and speedups below 0.7 are in _bold.

Each sub-group is further split into 4 more columns for the variants: the type/protocol used to perform the transformation:

Collection — SwizCol
Sequence — SwizSeq
Array
ContiguousArray — ContArr
These are again sorted in ascending order by the fastest variant in the fastest sub-group.

`ColorVal`

	`SwizCol`					`SwizCol`
	`lazy`					`—`
	`SwizSeq`					`SwizSeq`
	`Array`					`Array`
	`ContArray`				_—/l	`ContArray`
`flatMap`	25				_4.4	109
	118				_2.1	252
	2092				_0.1	208
	2118				_0.1	198
^m.j/fM	¹	¹	¹	¹		¹	^0.7	¹⁶	¹⁶
`map.joined`	25				_4.2	104
	118				_1.5	173
	2094				_1.5	3240
	2118				_1.5	3146
	`Unsafe`
`BytesReserve`	70
`Bytes`	77
`UInt32InitSeq`	99
`ColorValInitSeq`	157
`FlatMapArray`	189
`FlatMapColorVal`	198
	`—`
`for-in.Reserve`	220
`for-in`	223

I was motivated to find ever faster implementations, because on my 2008 MBP with 2.4 GHz Intel Core 2 Duo the performance of flatMap.Array was slightly worse than the imperative baseline for-in (491 vs. 444 μs). It looks like the 10 years of marginal improvements after hitting performance wall, the Intel Xeon E5 2.7 GHz manages to about double the performance to 208 vs. 220 μs for those two cases. Which might mislead us to think that our flatMap is doing well… But that's just beefy hardware masking poor code!

From my previous explorations I knew that using .lazy implementations could lead to much faster results. But to my great surprise the Array, as well as ContiguousArray present some kind of optimization barrier for the swift compiler, because switching to .lazy resulted in 10x degradation in performance! TODO: Bug Report.

After exploring several Unsafe approaches that yielded incremental improvements, I've stumbled upon promising strategy of conforming the underlying type to Sequence protocol and the absolute champion: Collection protocol conformance, which beats even the best Unsafe.BytesReserve variant by a wide margin.

I believe this demonstrates two things:

Swift has a great potential to leverage Protocol Oriented Programming into being a Sufficiently Smart Compiler and be as fast as C*,
but its current implementation, even within the narrowest confines of the most optimized collection in the whole Swift Standard Library can accidentally create a spectacular de-optimization which shows that Swift's current performance is frustratingly unpredictable.

*_{Not an actual comparison to C. This benchmarks just demonstrates gaping performance differences within Swift Standard Library.}

The ColorRef benchmarks show the impact of adding one level of indirection and introduction of reference counting.

`ColorRef`

	`SwizCol`					`SwizCol`
	`lazy`					`—`
	`SwizSeq`					`SwizSeq`
	`Array`					`Array`
	`ContArray`				_—/l	`ContArray`
`flatMap`	289				_1.8	515
	361				_1.6	563
	2399				_0.2	428
	2364				_0.2	427
^m.j/fM	¹	¹	¹	¹		¹	^1.2	⁸	⁸
`map.joined`	289				_2.0	574
	361				_1.8	654
	2390				_1.8	3493
	2364				_1.4	3419
	`—`
`for-in.Reserve`	374
`for-in`	381

Here's the relative comparison of value types to reference types in this benchmark family.

`ColorRef`/`ColorVal`

	`lazy`	`—`
`flatMap`	¹²	^4.7	`SwizCol`
	^3.1	^2.2	`SwizSeq`
	^1.1	^2.1	`Array`
	^1.1	^2.2	`ContArray`
`map.joined`	¹²	^5.5	`SwizCol`
	^3.1	^3.8	`SwizSeq`
	^1.1	^1.1	`Array`
	^1.1	^1.1	`ContArray`
	`—`
`for-in.Reserve`	^1,7
`for-in`	^1,7

palimondo · 2019-02-15T09:12:58Z

@swift-ci smoke test os x

CodaFi · 2019-11-18T21:17:51Z

@palimondo are you still interested in pursuing this change?

CodaFi · 2019-12-02T19:18:38Z

It's been a while. I'm going to close this out due to age and inactivity. @palimondo The tests here are still valuable to have. If you find the time, please either reopen this or shoot us another pull request.

palimondo · 2019-12-02T19:21:14Z

I’ll get back to this. Sorry for the long hibernation, I have somehow missed the previous ping here.

palimondo added 3 commits November 13, 2018 18:10

[benchmark] Adjust Driver’s console output format

1e7b28a

…to handle longer benchmark names, assuming maximum length of 40 characters.

[benchmark] LogParser: Accept - in bench. names

e59cfbc

Extend parser to support benchmark names that include `-` in names, as proposed in PR swiftlang#20334.

atrick reviewed Nov 14, 2018

View reviewed changes

eeckstein reviewed Nov 14, 2018

View reviewed changes

This comment has been minimized.

Sign in to view

This was referenced Nov 17, 2018

[benchmark] Existential Redux #20666

Merged

[benchmark] run_smoke_bench tweaks #20667

Merged

This comment has been minimized.

Sign in to view

[benchmark] Flatten: fine-tuning and documentation

3fd54b7

* Documented the motivation behind the various test scenarios. * More descriptive names for Unsafe variants. * New groups for color swizzling using Sequence and Collection conformances. * Reduced number of Array4 and Tuple4 variants.

palimondo changed the title ~~WIP [benchmark] Flatten - Extended test family~~ [benchmark] Flatten - Extended test family Nov 22, 2018

palimondo added the performance label Nov 26, 2018

This comment has been minimized.

Sign in to view

palimondo changed the title ~~[benchmark] Flatten - Extended test family~~ [WIP][benchmark] Flatten - Extended test family Dec 6, 2018

swiftlang deleted a comment from eeckstein Jan 5, 2019

palimondo mentioned this pull request Jan 11, 2019

Data Inlinability Refinements #21754

Merged

Merge branch 'master' into fluctuation-of-the-pupil

1fa9440

palimondo mentioned this pull request Feb 19, 2019

[benchmark] Finish Naming Convention Support #22726

Merged

Merge branch 'master' into fluctuation-of-the-pupil

aefb507

Merge branch 'master' into fluctuation-of-the-pupil

b7eaa79

palimondo mentioned this pull request Jul 24, 2019

[benchmark] Driver Improvements #26303

Merged

palimondo mentioned this pull request Aug 4, 2019

[benchmark] Robust Measurements #26462

Closed

CodaFi closed this Dec 2, 2019

[WIP][benchmark] Flatten - Extended test family #20552

[WIP][benchmark] Flatten - Extended test family #20552

Uh oh!

Conversation

palimondo commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Colors

Tuple4 and Array4

Uh oh!

palimondo commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atrick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eeckstein Nov 14, 2018

Choose a reason for hiding this comment

Uh oh!

palimondo Nov 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eeckstein Nov 16, 2018

Choose a reason for hiding this comment

Uh oh!

palimondo Nov 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palimondo Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

palimondo commented Nov 14, 2018

Uh oh!

atrick commented Nov 14, 2018

Uh oh!

This comment has been minimized.

palimondo commented Nov 14, 2018

Uh oh!

This comment has been minimized.

palimondo commented Nov 22, 2018

Uh oh!

palimondo commented Nov 26, 2018

Uh oh!

eeckstein commented Nov 28, 2018

Uh oh!

This comment has been minimized.

swift-ci commented Nov 28, 2018

Build comment file:

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Uh oh!

palimondo commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ColorVal

ColorRef

ColorRef/ColorVal

Uh oh!

palimondo commented Feb 15, 2019

Uh oh!

CodaFi commented Nov 18, 2019

Uh oh!

CodaFi commented Dec 2, 2019

Uh oh!

palimondo commented Dec 2, 2019

Uh oh!

Uh oh!

palimondo commented Nov 13, 2018 •

edited

Loading

palimondo commented Nov 13, 2018 •

edited

Loading

palimondo Nov 14, 2018 •

edited

Loading

palimondo Nov 17, 2018 •

edited

Loading

palimondo Nov 21, 2018 •

edited

Loading

palimondo commented Dec 6, 2018 •

edited

Loading

`ColorVal`

`ColorRef`

`ColorRef`/`ColorVal`