Skip to content

[benchmark] Janitor Duty: Datalore Legacy #21516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jan 8, 2019

Conversation

palimondo
Copy link
Contributor

@palimondo palimondo commented Dec 22, 2018

This PR follows-up #20861 and #21413. It's a next batch of benchmark clean-up to enable robust performance measurements by adjusting workloads to run in reasonable time (< 1000 μs), minimizing the accumulated error. To maintain long-term performance tracking, it applies legacy factor where necessary.

DataBenchmarks have been recently extended with a lot of variants in #20396, while some of the tests have regressed with regards to elimination of setup overhead (previously fixed in ae9f5f1). Many Large variants (problematic as well as some perfectly fine ones) have been disabled in #20411.

Before application of legacyFactor, it was necessary to clean up the file to ease future maintenance. I have applied the technique of shared test functions and inlined run functions pioneered by @lorentey in SetTests.

The disabled Large variants that were using .veryLarge sampleData were removed. We might re-introduce proper benchmarks for the LargeSlice implementation later, but for that the capacity has to be at least Int32.max (i.e. over 2GB). The sampleData(.veryLarge) was "only" 1GB.

The disabled Large tests that were using .large sampleData were re-enabled to maintain the continuity of long-term performance tracking, since these benchmarks predated the addition of Large variants form #20396 (disabled in #20411).

DataCreate[Small,Medium] benchmarks were dominated by arc4random_buf. We don’t need truly random data, so an increasing sequence of bytes is used to fills the buffer instead. This changes their runtime, but I didn't rename them, as they were introduce very recently — I believe this doesn't disrupt any long-term performance tracking yet.

@palimondo
Copy link
Contributor Author

@phausler @eeckstein Could you give this a casual look, and flag any deal-breakers?
I'm not yet finished here…

(It's best to review it by individual commits, as it is once again a step-by-step refactoring.)

@swift-ci

This comment has been minimized.

@phausler
Copy link
Contributor

Those regressions seem like they are hitting the wrong code paths.

The other ones that improved are worrying too since they are really fast items and I’d wager the reduced time will cause more noise since the sample size isn’t a large enough significant sample.

Particularly these:
StringToDataSmall 700 50 -92.9% 14.00x
StringToDataEmpty 600 50 -91.7% 12.00x
StringToDataMedium 2900 250 -91.4% 11.60x
DataToStringEmpty 2300 200 -91.3% 11.50x
DataToStringSmall 4200 400 -90.5% 10.50x
DataToStringMedium 10800 1100 -89.8% 9

@palimondo palimondo force-pushed the a-tall-white-fountain-played branch from af998a1 to 6702381 Compare December 22, 2018 15:56
@swift-ci

This comment has been minimized.

For extracting setup overhead.
Even the `large` sample Data is only 40KB, we can afford this to be static constant.
Removed the disabled Large variants (tagged `.skip`).

We might re-introduce proper benchmarks for the `LargeSlice` implementation later, but for **that** the capacity has to be at least Int32.max (i.e. over 2GB). The `sampleData(.veryLarge)` was only 1GB.
Refactored to use shared test method and inlined runFunctions.
Extracted setup overhead.

`blackHole` is ubiquitous in SBS, no need to state the obvious.
Refactored to use inlined runFunctions.
Extracted setup overhead.
Refactored to use shared test method and inlined runFunctions.
Extracted setup overhead.

Removed `Large` variant, as it was testing the same underlying implementation as `Medium`.
Refactored to use shared test method and inlined runFunctions.
Extracted setup overhead.

`blackHole` is ubiquitous in SBS, no need to state the obvious.
Refactored to use shared test method and inlined runFunctions.
Extracted setup overhead.
Refactored to use shared test method and inlined runFunctions.
Extracted setup overhead.

Removed `Large` variant, as it was testing the same underlying implementation as `Medium`.
Refactored to use shared test method and inlined runFunctions.
Extracted setup overhead.
Refactored to use inlined runFunctions.
Refactored to use inlined runFunctions.
Refactored to use shared test method and inlined runFunctions.
Extracted setup overhead.

Re-enabled `Large` tests errorneously disabled in swiftlang#20411.
Refactored to use shareable test method and inlined runFunction.
Refactored to use shareable test method and inlined runFunction.
Refactored to use shareable test method and inlined runFunction.
Refactored to use shared test method and inlined runFunctions.
Refactored to use shared test method and inlined runFunctions.

Re-enabled `Large` test errorneously disabled in swiftlang#20411. Removed `skip` tags, as this was the last use.
Refactored to use shared test method and inlined runFunctions.
@palimondo
Copy link
Contributor Author

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented Jan 4, 2019

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
DataCountMedium 28 31 +10.7% 0.90x (?)
Improvement
DataCreateMedium 16965 7000 -58.7% 2.42x
DataCreateSmall 65930 36000 -45.4% 1.83x
DataSubscriptSmall 31 28 -9.7% 1.11x
Added
DataAppendDataLargeToLarge 262 265 264
DataAppendDataMediumToLarge 180 188 183
DataReplaceLargeBuffer 44 45 44

Code size: -O

TEST OLD NEW DELTA RATIO
Improvement
DataBenchmarks.o 87599 63359 -27.7% 1.38x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
DataCountMedium 28 31 +10.7% 0.90x
DataAccessBytesSmall 100 108 +8.0% 0.93x
Improvement
DataCreateSmall 61439 36000 -41.4% 1.71x
DataCreateMedium 15748 9600 -39.0% 1.64x
DataSubscriptSmall 28 25 -10.7% 1.12x
Added
DataAppendDataLargeToLarge 260 262 261
DataAppendDataMediumToLarge 180 188 183
DataReplaceLargeBuffer 43 46 44

Code size: -Osize

TEST OLD NEW DELTA RATIO
Improvement
DataBenchmarks.o 36387 33339 -8.4% 1.09x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
DataCreateMedium 16730 616300 +3583.8% 0.03x
DataCreateSmall 64197 113000 +76.0% 0.57x
DataAppendArray 4595 5500 +19.7% 0.84x
Added
DataAppendDataLargeToLarge 259 2846 1123
DataAppendDataMediumToLarge 180 188 183
DataReplaceLargeBuffer 45 48 46
Benchmark Check Report
⚠️🔤 DataAppendDataMediumToLarge name is composed of 6 words.
Split DataAppendDataMediumToLarge name into dot-separated groups and variants. See http://bit.ly/BenchmarkNaming
⚠️🔤 DataAppendDataLargeToLarge name is composed of 6 words.
Split DataAppendDataLargeToLarge name into dot-separated groups and variants. See http://bit.ly/BenchmarkNaming
⚠️Ⓜ️ DataAppendDataLargeToLarge has very wide range of memory used between independent, repeated measurements.
DataAppendDataLargeToLarge mem_pages [i1, i2]: min=[44, 44] 𝚫=0 R=[37, 37]
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@palimondo
Copy link
Contributor Author

palimondo commented Jan 4, 2019

@phausler The String benchmarks were just an error on my part, I've lost a zero during loop multiplier extraction… 🤷‍♂️

I've finally realized what you meant by:

Those regressions seem like they are hitting the wrong code paths.

Initially, I was puzzling over Data.swift at swift-corelibs-foundation, not knowing there's a newer implementation right here in stdlib/public/Darwin/Foundation/Data.swift, with a slew of inlinable implementation variants. Now I see that #20396 was adding benchmarks needed to validate improvements from #20225. It would have helped a lot if this link was mentioned in those PRs.

You've mentioned before that:

Large variants are useful in the regards that it takes a completely different code path and has different performance characteristics

If I understand correctly, these were meant to exercise LargeSlice _Representation, right? But they did not — these were all InlineSlices! Therefore I have removed the disabled Large benchmark variants that were using .veryLarge sampleData. We might re-introduce proper benchmarks for the LargeSlice implementation later, but for that the capacity has to be at least Int32.max (i.e. over 2GB). The sampleData(.veryLarge) was "only" 1GB.

IMHO there are still multiple issues with benchmarking the initialization of Data, even after replacing arc4random which was dominating their runtime. My replacement works OK in optimized builds (it generates vectorized code) but fails quite badly in Onone. You said that:

The arc4random is being used to ensure the kernel does not just hand out a non faulting page of memory. Just being an allocation is not enough to properly identify used cases. It must be written to: and patterns like repeated are a completely different code path in the sequence initialization.

I have played with many different variants: Data.init(capacity:) and fully filling it with byte sequence, or Data.init(count:) and just writing 1 byte per page. It seems to take about the same time. But this is all utterly impractical for testing LargeSlice. I think we should instead rely on the page-faulted constant zero initialized Data.init(count:) which is still OK, when we want to compare relative costs between the various Data _Representations.

I suggest we ignore the runtime changes in DataCreateSmall and DataCreateMedium caused by my modifications here, for now and return to properly benchmarking the initialization in separate PR. What do you think?

It also looks like the "inline everything" strategy is backfiring on DataSubscript benchmarks, as @airspeedswift warned @itaiferber during review for swift-5.0-branch. Compare the results
1df944c did: before and after.

@palimondo palimondo changed the title [WIP][benchmark] Janitor Duty: Datalore Legacy [benchmark] Janitor Duty: Datalore Legacy Jan 4, 2019
@palimondo
Copy link
Contributor Author

@swift-ci please smoke test

@palimondo
Copy link
Contributor Author

@eeckstein @phausler Please review 🙏.
The step-by-step refactoring is best reviewed by individual commits.

Copy link
Contributor

@eeckstein eeckstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but please wait for @phausler to review.

@palimondo palimondo requested review from phausler and removed request for phausler January 7, 2019 22:40
Copy link
Contributor

@itaiferber itaiferber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here look good to me too, though I'd want @phausler to get a chance to review. Tiny nits inline; otherwise, do you mind commenting on how you decided on these legacy factors?

I'm also separately working on pulling some of the inlining out where it was excessive, so I'll try to test with the changes here so we're looking at a consistent view of the tests.

Copy link
Contributor

@phausler phausler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scalars could use a bit of renaming, but functionally this looks reasonable to me.

@palimondo
Copy link
Contributor Author

do you mind commenting on how you decided on these legacy factors?

When you run locally Benchmark_Driver check -f Data* it will suggest you a power of 2 and a power of 10 factors, which would get the runtimes under 1000μs. After that it's a judgment call to make them all fit and be as similar across the whole benchmark family as possible.

@palimondo
Copy link
Contributor Author

@itaiferber

I'm also separately working on pulling some of the inlining out where it was excessive, so I'll try to test with the changes here so we're looking at a consistent view of the tests.

I have a few extra benchmark variants stashed away… would you be interested?

@palimondo
Copy link
Contributor Author

@phausler, @itaiferber Could you also respond to my suggestion about testing LargeSlice and init from above?

@palimondo
Copy link
Contributor Author

palimondo commented Jan 8, 2019

Here are few init benchmark variations I've been playing with. These are using new naming convention from #20334. Note that the loop multiplier for Empty and Small differ between the groups, but Medium is always at 100.

I don't think it very important how fast is inlined Empty data, as it's just an empty enum creation and the 10_000 multiplier is ridiculous. Small But IIRC, Medium variants demonstrates various de-optimizations with inlining between the Data.init.Medium and [count,capacity].[Filled,Inlined].

Data.init.[count,capacity] variants
 
  BenchmarkInfo(name: "Data.init.Empty",
    runFunction: { init_($0*10_000, size: .empty) }, tags: d),
  BenchmarkInfo(name: "Data.init.Small",
    runFunction: { init_($0*100, size: .small) }, tags: d),
  BenchmarkInfo(name: "Data.init.Medium",
    runFunction: { init_($0*100, size: .medium) }, tags: d),

  BenchmarkInfo(name: "Data.init.count.Filled.Empty",
    runFunction: { init_($0*100, count: 0) }, tags: d),
  BenchmarkInfo(name: "Data.init.count.Filled.Small",
    runFunction: { init_($0*100, count: 11) }, tags: d),
  BenchmarkInfo(name: "Data.init.count.Filled.Medium",
    runFunction: { init_($0*100, count: 1033) }, tags: d),

  BenchmarkInfo(name: "Data.init.capacity.Filled.Empty",
    runFunction: { init_($0*100, capacity: 0) }, tags: d),
  BenchmarkInfo(name: "Data.init.capacity.Filled.Small",
    runFunction: { init_($0*100, capacity: 11) }, tags: d),
  BenchmarkInfo(name: "Data.init.capacity.Filled.Medium",
    runFunction: { init_($0*100, capacity: 1033) }, tags: d),

  // init(count: Int)
  // Creates a new data buffer with the specified count of zeroed bytes.
  BenchmarkInfo(name: "Data.init.count.Inlined.Empty",
    runFunction: { for _ in 0..<$0*10_000 { blackHole(Data(count: 0)) } },
    tags: d),
  BenchmarkInfo(name: "Data.init.count.Inlined.Small",
    runFunction: { for _ in 0..<$0*10_000 { blackHole(Data(count: 11)) } },
    tags: d),
  BenchmarkInfo(name: "Data.init.count.Inlined.Medium",
    runFunction: { for _ in 0..<$0*100 { blackHole(Data(count: 1033)) } },
    tags: d),

  // init(capacity: Int)
  // Creates an empty data buffer of a specified size.
  BenchmarkInfo(name: "Data.init.capacity.Inlined.Small",
    runFunction: { for _ in 0..<$0*100 {
      let size = 11
      var data = Data(capacity: size)
      data.withUnsafeMutableBytes { (bytes: UnsafeMutablePointer<UInt8>) -> () in
        for i in 0..<size {
          bytes[i] = UInt8(truncatingIfNeeded: i)
        }
      }
      blackHole(d)
    } },
    tags: d),
  BenchmarkInfo(name: "Data.init.capacity.Inlined.Medium",
    runFunction: { for _ in 0..<$0*100 {
      let size = 1033
      var data = Data(capacity: size)
      data.withUnsafeMutableBytes { (bytes: UnsafeMutablePointer<UInt8>) -> () in
        for i in 0..<size {
          bytes[i] = UInt8(truncatingIfNeeded: i)
        }
      }
      blackHole(d)
    } },
    tags: d),



@inline(never)
public func init_(_ N: Int, count: Int) {
  for _ in 1...N {
    var data = Data(count: count)
    data.withUnsafeMutableBytes { (bytes: UnsafeMutablePointer<UInt8>) -> () in
      for i in 0..<count {
        bytes[i] = UInt8(truncatingIfNeeded: i)
      }
    }
    blackHole(data)
  }
}

@inline(never)
public func init_(_ N: Int, capacity: Int) {
  for _ in 1...N {
    var data = Data(capacity: capacity)
    data.withUnsafeMutableBytes { (bytes: UnsafeMutablePointer<UInt8>) -> () in
      for i in 0..<capacity {
        bytes[i] = UInt8(truncatingIfNeeded: i)
      }
    }
    blackHole(data)
  }
}

@inline(never)
public func init_(_ N: Int, size: SampleKind) {
  for _ in 1...N {
    blackHole(sampleData(size))
  }
}

Also added 2 forgotten legacy factors.
@palimondo palimondo force-pushed the a-tall-white-fountain-played branch from a55ce72 to 7db46b3 Compare January 8, 2019 01:07
@swiftlang swiftlang deleted a comment from swift-ci Jan 8, 2019
@palimondo
Copy link
Contributor Author

@swift-ci please benchmark

@palimondo
Copy link
Contributor Author

@swift-ci please smoke test

@swift-ci
Copy link
Contributor

swift-ci commented Jan 8, 2019

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
DataCountSmall 22 26 +18.2% 0.85x
DataCountMedium 28 32 +14.3% 0.88x
Improvement
DataCreateMedium 15671 6800 -56.6% 2.30x
DataCreateSmall 61920 41000 -33.8% 1.51x
DataCreateEmpty 200 170 -15.0% 1.18x
DataSubscriptSmall 31 28 -9.7% 1.11x (?)
Added
DataAppendDataLargeToLarge 51600 54400 52733
DataAppendDataMediumToLarge 36000 37200 36400
DataReplaceLargeBuffer 44 49 46

Code size: -O

TEST OLD NEW DELTA RATIO
Improvement
DataBenchmarks.o 87599 62543 -28.6% 1.40x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
DataCountMedium 28 34 +21.4% 0.82x
DataCountSmall 25 28 +12.0% 0.89x
DataSubscriptSmall 28 31 +10.7% 0.90x
Improvement
DataCreateMedium 16568 10300 -37.8% 1.61x
DataCreateSmall 61515 41000 -33.3% 1.50x
Added
DataAppendDataLargeToLarge 58400 59000 58800
DataAppendDataMediumToLarge 32800 33600 33067
DataReplaceLargeBuffer 42 44 43

Code size: -Osize

TEST OLD NEW DELTA RATIO
Improvement
DataBenchmarks.o 36387 32955 -9.4% 1.10x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
DataCreateMedium 17546 607600 +3362.9% 0.03x
DataCreateSmall 65384 112000 +71.3% 0.58x
DataAppendArray 4648 5600 +20.5% 0.83x
Added
DataAppendDataLargeToLarge 52600 591000 232600
DataAppendDataMediumToLarge 35800 36600 36067
DataReplaceLargeBuffer 45 48 46
Benchmark Check Report
⚠️🔤 DataAppendDataMediumToLarge name is composed of 6 words.
Split DataAppendDataMediumToLarge name into dot-separated groups and variants. See http://bit.ly/BenchmarkNaming
⚠️🔤 DataAppendDataLargeToLarge name is composed of 6 words.
Split DataAppendDataLargeToLarge name into dot-separated groups and variants. See http://bit.ly/BenchmarkNaming
⚠️Ⓜ️ DataAppendDataLargeToLarge has very wide range of memory used between independent, repeated measurements.
DataAppendDataLargeToLarge mem_pages [i1, i2]: min=[43, 43] 𝚫=0 R=[0, 37]
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@palimondo palimondo merged commit bd16513 into swiftlang:master Jan 8, 2019
@itaiferber
Copy link
Contributor

@palimondo Regarding the large slice cases — I personally believe that both large slices and large inline slices are worth testing, but perhaps not at the cost of destabilizing other tests, or causing unnecessary delays in benchmarking for everyone. With more extensive, separate benchmarking infrastructure (to make it more reasonable to test this case more sparingly), it might be worth investigating. I think we're okay to drop the tests for now though.

@palimondo
Copy link
Contributor Author

Perhaps I wasn't being clear enough. My point was, that I believe the point @phausler was making about the .veryLarge test wasn't really necessary:

The arc4random is being used to ensure the kernel does not just hand out a non faulting page of memory. Just being an allocation is not enough to properly identify used cases.

When I played with Data.init(count:) I saw how kernel cheats and gives you a lot of memory in almost no time. The huge delay comes first when writing to it. So I think we can use this to our advantage: quickly get the LargeSlice representation of Data backed with faulting pages. But that doesn't really matter for the purposes of our tests, because we don't want to be benchmarking how quickly kernel copies memory. We are interested in the relative performance of the underlying _Representationss and even though they might be backed by virtual memory subsystem's trickery, as long our workloads are the same (expanding the count by same amount or replacing the same amount of bytes, etc) we are correctly comparing the relative costs of InlineSlice vs. LargeSlice. Am I mistaken in my assumption here?

@phausler
Copy link
Contributor

phausler commented Jan 8, 2019

The perf differential tween InlineSlice and LargeSlice has to do with the indirection to the additional ref type storing the range. Ideally we want the performance of that indirection to be similar to each other, and at the very least to be dwarfed by normal access. Unfortunately the benchmarks as they are don't give us that type of granularity and are likely useful to be done in an instrumentation mechanism. Most of these benchmarks are suitable to be stuck in an infinite (or practically infinite) loop and be profiled via tools like Instruments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants