GC drastically reducing benchmark performance

**Type**: performance bug

# Demonstration
```
module Main where

import Criterion.Main
import qualified Data.Map as Map

main :: IO ()
main = defaultMain longGc

longGc :: [Benchmark]
longGc =
    let small = Map.fromList [ (k, v) | k <- [1..100], v <- ["data"] ]
        large = Map.fromList [ (k, v) | k <- [1..1000000], v <- ["data"] ]
    in [ bench "small-fast" $ whnf id small
       , bench "large-slow" $ whnf id large
       ]
```

The only difference between these two benchmarks is the size of the data; they should have similar performance. If we look at the output, they do. However, it takes 13x longer to run the benchmark on the large data.

```
benchmarking small-fast
bootstrapping with 42 of 312 samples (13%)
time                 7.105 ns   (6.882 ns .. 7.314 ns)

benchmarking large-slow
measurement took 69.15 s
bootstrapping with 16 of 289 samples (5%)
time                 6.506 ns   (6.286 ns .. 6.771 ns)

     354,932,168 bytes allocated in the heap
  74,880,374,424 bytes copied during GC
      64,230,936 bytes maximum residency (2430 sample(s))
         224,952 bytes maximum slop
             134 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       323 colls,     0 par    0.408s   0.431s     0.0013s    0.0813s
  Gen  1      2430 colls,     0 par   68.064s  68.127s     0.0280s    0.0717s

  MUT     time    6.576s  (  6.612s elapsed)
  GC      time   68.472s  ( 68.558s elapsed)
  Total   time   75.052s  ( 75.176s elapsed)

  %GC     time      91.2%  (91.2% elapsed) 
  Productivity   8.8% of total user, 8.8% of total elapsed
```

The first one takes less than 5 seconds (the default "time limit" for a single benchmark), while the large one takes over a minute.

# The Issue

The main culprit here is copying bytes during garbage collection. `performGC` is called **4 times** for every sample. From the output above, we can see that we have around 600 samples, which explains the 2400 garbage collections composing 90% of the runtime.

Here are the offenders:

https://github.com/bos/criterion/blob/46dabb3bb9d1857aec851e21dc1df7b79f3490ec/Criterion/Measurement.hs#L174-L190

https://github.com/bos/criterion/blob/46dabb3bb9d1857aec851e21dc1df7b79f3490ec/Criterion/Measurement.hs#L253-L254

The `f count run` in the second code block is calling the body of the first code block.

So at least 2 of the `performGC`s are useless when `measure` is called. The ones in `runBenchmarkable` are farther from the code, and so more eligible for being removed. The only code that calls `runBenchmarkable` but not `measure` is `runFixedIters`, which does not take any measurements. The only sticking point is that `runBenchmarkable` is exported, so this *could* impact someone using it.

That covers half of the time. The two options I see for the other half are:
* Use `performMinorGC` (not sure if valid)
* Make the cost opt-in

I'll need to experiment a bit to see how important those two garbage collections are to the actual timing of the benchmarks. It could be that the opt-in approach won't work.

I have a solution to reduce the total number of samples too, which will slightly soften the blow if we can't get the second half down.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GC drastically reducing benchmark performance #185

Demonstration

The Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GC drastically reducing benchmark performance #185

Description

Demonstration

The Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions