[benchmark] Add two benchmarks that show performance of flattening an… #20116

gottesmm · 2018-10-28T22:05:43Z

… array.

The first is a naive imperative approach using appends in a loop. The second
uses flatMap. We would like both of these to have equivalent performance.

… array. The first is a naive imperative approach using appends in a loop. The second uses flatMap. We would like both of these to have equivalent performance.

gottesmm · 2018-10-28T22:57:49Z

@swift-ci smoke benchmark

gottesmm · 2018-10-28T22:57:54Z

@swift-ci smoke test

swift-ci · 2018-10-29T02:10:28Z

Build comment file:

Performance: -O

TEST	OLD	NEW	DELTA	RATIO
Improvement
ObjectiveCBridgeStubToNSDate2	1591	1474	-7.4%	1.08x (?)
Added
FlattenListFlatMap	6402	7484	6777	—
FlattenListLoop	3980	5056	4350	—

Code size: -O

TEST	OLD	NEW	DELTA	RATIO
Regression
main.o	46402	47234	+1.8%	0.98x

Performance: -Osize

TEST	MIN	MAX	MEAN	MAX_RSS
Added
FlattenListFlatMap	44982	46004	45324	—
FlattenListLoop	4059	5165	4429	—

Code size: -Osize

TEST	OLD	NEW	DELTA	RATIO
Regression
main.o	43641	44441	+1.8%	0.98x

Performance: -Onone

TEST	MIN	MAX	MEAN	MAX_RSS
Added
FlattenListFlatMap	239757	241062	240269	—
FlattenListLoop	55197	56498	55673	—

How to read the data

The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein).

Hardware Overview

  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

gottesmm · 2018-10-29T05:44:13Z

@swift-ci smoke test os x platform

palimondo · 2018-10-31T21:57:01Z

@gottesmm I would to like lower the base workload of these two tests by a factor of 20:

remove multiplier from the loop (5)
lower the array size to (1<<14)
… so that we can get runtimes in ballpark of 1ms for more precise measurements.

The FlattenListLoop reserves capacity , which seems like an unfair advantage given the goal you state above:

The first is a naive imperative approach using appends in a loop. The second
uses flatMap. We would like both of these to have equivalent performance.

I don't see how flatMap could derive this because it requires multiplication by tuple size, so I'd suggest we remove the pre-reservation. (Or are you planning on implementing this as some kind of special optimization?)

Would you be OK with these modifications?

gottesmm · 2018-11-01T00:49:28Z

Do what you need.

I think you misunderstood me. When I say "We would like both of these to have equivalent performance" what I mean is that we would like FlattenListFlatMap to be as close to FlattenListLoop in performance as possible. In a perfect world they would be the same. If the wording is confusing, feel free to change it.

The reason why the reserve capacity is in there is as an attempt to measure the speed of light, i.e. how fast ideally could this be imperatively for comparison purposes. That being said, I do understand the larger point. My suggestion would be to add a separate version of the loop without the reserve capacity. Then we can see all 3.

palimondo · 2018-11-02T16:54:07Z

One more question… you called the loop one a "naive imperative approach". What would be a smarter way to do this? First I thought there is some kind of SIMD optimization lurking behind the fact that we're mapping over array of 4 Int tuple, but then thought the 4 Int tuple array is probably stored as packed 4 Ints in a consecutive memory region, so this could probably be done in O(1) with some kind of type info redeclaration. Am I thinking along the correct lines?

Which compiler/stdlib optimization opportunities did you have in mind, adding these benchmarks?

gottesmm · 2018-11-02T20:09:18Z

It is a naive imperative approach since it is the obvious way to do it. I think you are thinking too much about this.

gottesmm · 2018-11-02T23:54:24Z

Sorry, let me be a bit clearer.

It is a naive implementation to me since it is what I would write quickly.

In terms of what compiler/stdlib opportunities are available I am not sure. But this (and the additional test that I hope will get added) I think may provide interesting guidance around swift's performance when executing functional programming. I think this is a class of benchmarks that we could use more of, so I saw this as an opportunity to add coverage.

palimondo · 2018-11-08T19:48:55Z

So, when I really abuse the knowledge of internal structure of the [(Int, Int, Int, Int)], this is (I guess) the true speed of light (16x faster than FlattenLoop with reserveCapacity)…

func flattenUnsafe(_ input: [(Int, Int, Int, Int)]) -> [Int] {
  return UnsafeBufferPointer(start: input, count: input.count)
    .withMemoryRebound(to: Int.self, Array.init)
}

Would you mind if I also added this one, or is this a bridge too far?

gottesmm · 2018-11-08T20:42:35Z

I think that is not necessarily defined behavior (or at least it makes me a bit nervous). @atrick your thoughts?

One thing that I think we could use no matter what is a copy of the benchmark without reserve capacity.

palimondo · 2018-11-08T20:48:42Z

Sure. I’m on it. Just wanted to do all the changes together, so I’m asking about this extreme performance case, too. The idea is to serve as an inspiration for potential future optimization.

atrick · 2018-11-08T21:24:55Z

I'm just going to be honest. I don't know what that code does, and it makes me sad that you can write it.

I guess it relies on an implicit conversion from Array<Int> to UnsafePointer<Int>?. That's lifetime-unsafe. Never do that in a pointer constructor. Use Array.withUnsafeBufferPointer instead.

Then it initializes an Array from a buffer pointer using type-based dispatch on an inferred type, which always makes it impossible for me to decipher the code. Which Array initializer is being called? 🤷‍♂️

There is an Array API designed for this, _unsafeUninitializedCapacity. But it hasn't made it through swift-evolution yet:

https://forums.swift.org/t/se-0223-accessing-an-arrays-uninitialized-buffer/15194/41

gottesmm · 2018-11-09T00:41:36Z

@palimondo Was thinking about this a bit. I am unsure if the unsafe version is interesting now that I am thinking about it. No /real/ work is being done (unless I am missing something).

That being said, I think adding a benchmark without the reserve capacity and an additional lazy flat map would be interesting.

palimondo · 2018-11-09T02:37:30Z

@atrick Heh, you give me too much credit. This is my first time using Unsafe API and I wasn’t able to write that until I tried googling “swift array unsafe pointer copy”, this gist came up. Then I had to play with it in REPL, until it “worked”.

The idea comes from vague remembernce that the fastest Swift version of the Mandelbrot benchmark in language shootout used tuples as SIMD types. So I assumed the tuple in [(Int, Int)] is just bunch of Ints in consecutive memory.
The Array constructor used here is the one that takes Sequence (I guess from errors I was getting until it all type-checked).

@gottesmm This isn’t O(1), as I thought initially, this is O(n) - it does depends on the size of input. I think it just copies those Ints really fast?

palimondo · 2018-11-09T03:07:12Z

All in all, I think if I rewrote it with input.withUnsafeBufferPointer as @atrick suggested, this is perfectly legal code and could serve as a benchmark of what an ideally optimized flatMap could achieve. Beating the naive for-in to the ground. Am I wrong? (Maybe I shouldn’t be typing this, when I woke up at 3:50am…)

atrick · 2018-11-09T05:54:51Z

@palimondo The gist has a bit less type inference, so it's easier for me to see what's happening. I didn't realize earlier that, if I'm guessing correctly, you're actually using Array's generic Sequence initializer, which has a customization point that ends up calling into the UnsafeBufferPointer API. If you just wanted a fast array copy, this would be a silly, round about way to do it. It's only needed because you want to make some dangerous layout assumption.

That said, I expect this code to crash, at least in debug builds, because you can't rebind a buffer's memory to a different element type unless they have the same stride. There's been some debate about whether to allow that. I think it creates more problems than it's worth.

The right way to do it is:

public func flatten(_ tupleArray: [(Int, Int)])  -> [Int] {
  let numInts = tupleArray.count * 2
  return tupleArray.withUnsafeBufferPointer {
    return $0.baseAddress!.withMemoryRebound(to: Int.self, capacity: numInts) {
      Array(UnsafeBufferPointer(start: $0, count: numInts))
    }
  }
}

Note: Swift does not provide any formal layout rule that says an array of homogeneous tuples is laid out the same as an array of those tuple elements. But with builtin integer types I feel that you're safe making this assumption.

palimondo · 2018-11-09T13:52:58Z

@atrick Thanks for showing me the right way. I'm still waiting for the debug build to finish and verify that my Unsafe benchmark was illegal code (I'll file a bug if that's the case), but I also had to fix it (input.count * 4), because it was copying just 1/4 of the input array 🤦‍♂️.

That's the danger of using just blackHole instead of some sanity CheckResults... I'll fix that, too.

So, the correct speedup vis-a-vis FlatenLoop with reserveCapacity is 3.4x. @gottesmm Would adding this make sense as aspirational target for ideally optimized version of flatMap? I'll add lazy versions and I'll also do few more, where the mapped over sequence is not an array. I ❤️ exploring these combinations!

palimondo · 2018-11-09T17:37:59Z

Uhm... the debug build failed to compile after 6 hours, so I couldn't verify... Was I doing it right?
./swift/utils/build-script -B -R --debug-swift --debug-swift-stdlib --no-assertions --swift-assertions --swift-stdlib-assertions

The extended `Flatten` test family is based on recently added benchmarks `FlattenLostLoop` and `FlattenListFlatMap`. They had unnecessarily large base workloads, which prevented more precise measurement. Their base workload was lowered by a factor of 20. See discussion on swiftlang#20116 (comment) Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names `Flatten.Array.Tuple4.flatMap` and `Flatten.Array.Tuple4.for-in.Reserve`without going through the `legacyFactor`. Based on these two templates, this commit introduces thorough performance test coverage of the related space including: * method chain `map.joined` * naive for-in implementation without `reserveCapacity` * few Unsafe variants that should serve as aspirational targets for ideally optimized code * lazy variants * variants for different underlying types: 4 element Array, struct and class in addition to the original 4 element tuple * variants that flatten Sequence instead of Array The tests follow naming convention proposed in swiftlang#20334

gottesmm force-pushed the pr-899da50e441c8ce022a355432dcc2e01334cbe1a branch 2 times, most recently from 86d51e9 to 990f925 Compare October 28, 2018 22:55

[benchmark] Add two benchmarks that show performance of flattening an…

4d76ff9

… array. The first is a naive imperative approach using appends in a loop. The second uses flatMap. We would like both of these to have equivalent performance.

gottesmm force-pushed the pr-899da50e441c8ce022a355432dcc2e01334cbe1a branch from 990f925 to 4d76ff9 Compare October 28, 2018 22:55

gottesmm merged commit b80991d into swiftlang:master Oct 29, 2018

gottesmm deleted the pr-899da50e441c8ce022a355432dcc2e01334cbe1a branch October 29, 2018 06:42

palimondo mentioned this pull request Nov 13, 2018

[WIP][benchmark] Flatten - Extended test family #20552

Closed

[benchmark] Add two benchmarks that show performance of flattening an… #20116

[benchmark] Add two benchmarks that show performance of flattening an… #20116

Uh oh!

Conversation

gottesmm commented Oct 28, 2018

Uh oh!

gottesmm commented Oct 28, 2018

Uh oh!

gottesmm commented Oct 28, 2018

Uh oh!

swift-ci commented Oct 29, 2018

Build comment file:

Performance: -O

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Uh oh!

gottesmm commented Oct 29, 2018

Uh oh!

palimondo commented Oct 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gottesmm commented Nov 1, 2018

Uh oh!

palimondo commented Nov 2, 2018

Uh oh!

gottesmm commented Nov 2, 2018

Uh oh!

gottesmm commented Nov 2, 2018

Uh oh!

palimondo commented Nov 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gottesmm commented Nov 8, 2018

Uh oh!

palimondo commented Nov 8, 2018

Uh oh!

atrick commented Nov 8, 2018

Uh oh!

gottesmm commented Nov 9, 2018

Uh oh!

palimondo commented Nov 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Nov 9, 2018

Uh oh!

atrick commented Nov 9, 2018

Uh oh!

palimondo commented Nov 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

palimondo commented Nov 9, 2018

Uh oh!

Uh oh!

palimondo commented Oct 31, 2018 •

edited

Loading

palimondo commented Nov 8, 2018 •

edited

Loading

palimondo commented Nov 9, 2018 •

edited

Loading

palimondo commented Nov 9, 2018 •

edited

Loading