Skip to content

[benchmark] Add two benchmarks that show performance of flattening an… #20116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gottesmm
Copy link
Contributor

… array.

The first is a naive imperative approach using appends in a loop. The second
uses flatMap. We would like both of these to have equivalent performance.

@gottesmm gottesmm force-pushed the pr-899da50e441c8ce022a355432dcc2e01334cbe1a branch 2 times, most recently from 86d51e9 to 990f925 Compare October 28, 2018 22:55
… array.

The first is a naive imperative approach using appends in a loop. The second
uses flatMap. We would like both of these to have equivalent performance.
@gottesmm gottesmm force-pushed the pr-899da50e441c8ce022a355432dcc2e01334cbe1a branch from 990f925 to 4d76ff9 Compare October 28, 2018 22:55
@gottesmm
Copy link
Contributor Author

@swift-ci smoke benchmark

@gottesmm
Copy link
Contributor Author

@swift-ci smoke test

@swift-ci
Copy link
Contributor

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Improvement
ObjectiveCBridgeStubToNSDate2 1591 1474 -7.4% 1.08x (?)
Added
FlattenListFlatMap 6402 7484 6777
FlattenListLoop 3980 5056 4350

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
main.o 46402 47234 +1.8% 0.98x

Performance: -Osize

TEST MIN MAX MEAN MAX_RSS
Added
FlattenListFlatMap 44982 46004 45324
FlattenListLoop 4059 5165 4429

Code size: -Osize

TEST OLD NEW DELTA RATIO
Regression
main.o 43641 44441 +1.8% 0.98x

Performance: -Onone

TEST MIN MAX MEAN MAX_RSS
Added
FlattenListFlatMap 239757 241062 240269
FlattenListLoop 55197 56498 55673
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false alarms. Unexpected regressions which are marked with '(?)' are probably noise. If you see regressions which you cannot explain you can try to run the benchmarks again. If regressions still show up, please consult with the performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@gottesmm
Copy link
Contributor Author

@swift-ci smoke test os x platform

@gottesmm gottesmm merged commit b80991d into swiftlang:master Oct 29, 2018
@gottesmm gottesmm deleted the pr-899da50e441c8ce022a355432dcc2e01334cbe1a branch October 29, 2018 06:42
@palimondo
Copy link
Contributor

palimondo commented Oct 31, 2018

@gottesmm I would to like lower the base workload of these two tests by a factor of 20:

The FlattenListLoop reserves capacity , which seems like an unfair advantage given the goal you state above:

The first is a naive imperative approach using appends in a loop. The second
uses flatMap. We would like both of these to have equivalent performance.

I don't see how flatMap could derive this because it requires multiplication by tuple size, so I'd suggest we remove the pre-reservation. (Or are you planning on implementing this as some kind of special optimization?)

Would you be OK with these modifications?

@gottesmm
Copy link
Contributor Author

gottesmm commented Nov 1, 2018

Do what you need.

I think you misunderstood me. When I say "We would like both of these to have equivalent performance" what I mean is that we would like FlattenListFlatMap to be as close to FlattenListLoop in performance as possible. In a perfect world they would be the same. If the wording is confusing, feel free to change it.

The reason why the reserve capacity is in there is as an attempt to measure the speed of light, i.e. how fast ideally could this be imperatively for comparison purposes. That being said, I do understand the larger point. My suggestion would be to add a separate version of the loop without the reserve capacity. Then we can see all 3.

@palimondo
Copy link
Contributor

One more question… you called the loop one a "naive imperative approach". What would be a smarter way to do this? First I thought there is some kind of SIMD optimization lurking behind the fact that we're mapping over array of 4 Int tuple, but then thought the 4 Int tuple array is probably stored as packed 4 Ints in a consecutive memory region, so this could probably be done in O(1) with some kind of type info redeclaration. Am I thinking along the correct lines?

Which compiler/stdlib optimization opportunities did you have in mind, adding these benchmarks?

@gottesmm
Copy link
Contributor Author

gottesmm commented Nov 2, 2018

It is a naive imperative approach since it is the obvious way to do it. I think you are thinking too much about this.

@gottesmm
Copy link
Contributor Author

gottesmm commented Nov 2, 2018

Sorry, let me be a bit clearer.

It is a naive implementation to me since it is what I would write quickly.

In terms of what compiler/stdlib opportunities are available I am not sure. But this (and the additional test that I hope will get added) I think may provide interesting guidance around swift's performance when executing functional programming. I think this is a class of benchmarks that we could use more of, so I saw this as an opportunity to add coverage.

@palimondo
Copy link
Contributor

palimondo commented Nov 8, 2018

So, when I really abuse the knowledge of internal structure of the [(Int, Int, Int, Int)], this is (I guess) the true speed of light (16x faster than FlattenLoop with reserveCapacity)…

func flattenUnsafe(_ input: [(Int, Int, Int, Int)]) -> [Int] {
  return UnsafeBufferPointer(start: input, count: input.count)
    .withMemoryRebound(to: Int.self, Array.init)
}

Would you mind if I also added this one, or is this a bridge too far?

@gottesmm
Copy link
Contributor Author

gottesmm commented Nov 8, 2018

I think that is not necessarily defined behavior (or at least it makes me a bit nervous). @atrick your thoughts?

One thing that I think we could use no matter what is a copy of the benchmark without reserve capacity.

@palimondo
Copy link
Contributor

Sure. I’m on it. Just wanted to do all the changes together, so I’m asking about this extreme performance case, too. The idea is to serve as an inspiration for potential future optimization.

@atrick
Copy link
Contributor

atrick commented Nov 8, 2018

I'm just going to be honest. I don't know what that code does, and it makes me sad that you can write it.

I guess it relies on an implicit conversion from Array<Int> to UnsafePointer<Int>?. That's lifetime-unsafe. Never do that in a pointer constructor. Use Array.withUnsafeBufferPointer instead.

Then it initializes an Array from a buffer pointer using type-based dispatch on an inferred type, which always makes it impossible for me to decipher the code. Which Array initializer is being called? 🤷‍♂️

There is an Array API designed for this, _unsafeUninitializedCapacity. But it hasn't made it through swift-evolution yet:

https://forums.swift.org/t/se-0223-accessing-an-arrays-uninitialized-buffer/15194/41

@gottesmm
Copy link
Contributor Author

gottesmm commented Nov 9, 2018

@palimondo Was thinking about this a bit. I am unsure if the unsafe version is interesting now that I am thinking about it. No /real/ work is being done (unless I am missing something).

That being said, I think adding a benchmark without the reserve capacity and an additional lazy flat map would be interesting.

@palimondo
Copy link
Contributor

palimondo commented Nov 9, 2018

@atrick Heh, you give me too much credit. This is my first time using Unsafe API and I wasn’t able to write that until I tried googling “swift array unsafe pointer copy”, this gist came up. Then I had to play with it in REPL, until it “worked”.

The idea comes from vague remembernce that the fastest Swift version of the Mandelbrot benchmark in language shootout used tuples as SIMD types. So I assumed the tuple in [(Int, Int)] is just bunch of Ints in consecutive memory.
The Array constructor used here is the one that takes Sequence (I guess from errors I was getting until it all type-checked).

@gottesmm This isn’t O(1), as I thought initially, this is O(n) - it does depends on the size of input. I think it just copies those Ints really fast?

@palimondo
Copy link
Contributor

All in all, I think if I rewrote it with input.withUnsafeBufferPointer as @atrick suggested, this is perfectly legal code and could serve as a benchmark of what an ideally optimized flatMap could achieve. Beating the naive for-in to the ground. Am I wrong? (Maybe I shouldn’t be typing this, when I woke up at 3:50am…)

@atrick
Copy link
Contributor

atrick commented Nov 9, 2018

@palimondo The gist has a bit less type inference, so it's easier for me to see what's happening. I didn't realize earlier that, if I'm guessing correctly, you're actually using Array's generic Sequence initializer, which has a customization point that ends up calling into the UnsafeBufferPointer API. If you just wanted a fast array copy, this would be a silly, round about way to do it. It's only needed because you want to make some dangerous layout assumption.

That said, I expect this code to crash, at least in debug builds, because you can't rebind a buffer's memory to a different element type unless they have the same stride. There's been some debate about whether to allow that. I think it creates more problems than it's worth.

The right way to do it is:

public func flatten(_ tupleArray: [(Int, Int)])  -> [Int] {
  let numInts = tupleArray.count * 2
  return tupleArray.withUnsafeBufferPointer {
    return $0.baseAddress!.withMemoryRebound(to: Int.self, capacity: numInts) {
      Array(UnsafeBufferPointer(start: $0, count: numInts))
    }
  }
}

Note: Swift does not provide any formal layout rule that says an array of homogeneous tuples is laid out the same as an array of those tuple elements. But with builtin integer types I feel that you're safe making this assumption.

@palimondo
Copy link
Contributor

palimondo commented Nov 9, 2018

@atrick Thanks for showing me the right way. I'm still waiting for the debug build to finish and verify that my Unsafe benchmark was illegal code (I'll file a bug if that's the case), but I also had to fix it (input.count * 4), because it was copying just 1/4 of the input array 🤦‍♂️.

That's the danger of using just blackHole instead of some sanity CheckResults... I'll fix that, too.

So, the correct speedup vis-a-vis FlatenLoop with reserveCapacity is 3.4x. @gottesmm Would adding this make sense as aspirational target for ideally optimized version of flatMap? I'll add lazy versions and I'll also do few more, where the mapped over sequence is not an array. I ❤️ exploring these combinations!

@palimondo
Copy link
Contributor

Uhm... the debug build failed to compile after 6 hours, so I couldn't verify... Was I doing it right?
./swift/utils/build-script -B -R --debug-swift --debug-swift-stdlib --no-assertions --swift-assertions --swift-stdlib-assertions

palimondo added a commit to palimondo/swift that referenced this pull request Nov 13, 2018
The extended `Flatten` test family is based on recently added benchmarks `FlattenLostLoop` and `FlattenListFlatMap`. They had unnecessarily large base workloads, which prevented more precise measurement. Their base workload was lowered by a factor of 20. See discussion on swiftlang#20116 (comment)

Since these are recent additions to the Swift Benchmark Suite, I’m removing the originals and reintroducing them under new names `Flatten.Array.Tuple4.flatMap` and `Flatten.Array.Tuple4.for-in.Reserve`without going through the `legacyFactor`.

Based on these two templates, this commit introduces thorough performance test coverage of the related space including:

* method chain `map.joined`
* naive for-in implementation without `reserveCapacity`
* few Unsafe variants that should serve as aspirational targets for ideally optimized code
* lazy variants
* variants for different underlying types: 4 element Array, struct and class in addition to the original 4 element tuple
* variants that flatten Sequence instead of Array

The tests follow naming convention proposed in swiftlang#20334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants