Skip to content

GC drastically reducing benchmark performance #185

@patrickdoc

Description

@patrickdoc

Type: performance bug

Demonstration

module Main where

import Criterion.Main
import qualified Data.Map as Map

main :: IO ()
main = defaultMain longGc

longGc :: [Benchmark]
longGc =
    let small = Map.fromList [ (k, v) | k <- [1..100], v <- ["data"] ]
        large = Map.fromList [ (k, v) | k <- [1..1000000], v <- ["data"] ]
    in [ bench "small-fast" $ whnf id small
       , bench "large-slow" $ whnf id large
       ]

The only difference between these two benchmarks is the size of the data; they should have similar performance. If we look at the output, they do. However, it takes 13x longer to run the benchmark on the large data.

benchmarking small-fast
bootstrapping with 42 of 312 samples (13%)
time                 7.105 ns   (6.882 ns .. 7.314 ns)

benchmarking large-slow
measurement took 69.15 s
bootstrapping with 16 of 289 samples (5%)
time                 6.506 ns   (6.286 ns .. 6.771 ns)

     354,932,168 bytes allocated in the heap
  74,880,374,424 bytes copied during GC
      64,230,936 bytes maximum residency (2430 sample(s))
         224,952 bytes maximum slop
             134 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       323 colls,     0 par    0.408s   0.431s     0.0013s    0.0813s
  Gen  1      2430 colls,     0 par   68.064s  68.127s     0.0280s    0.0717s

  MUT     time    6.576s  (  6.612s elapsed)
  GC      time   68.472s  ( 68.558s elapsed)
  Total   time   75.052s  ( 75.176s elapsed)

  %GC     time      91.2%  (91.2% elapsed) 
  Productivity   8.8% of total user, 8.8% of total elapsed

The first one takes less than 5 seconds (the default "time limit" for a single benchmark), while the large one takes over a minute.

The Issue

The main culprit here is copying bytes during garbage collection. performGC is called 4 times for every sample. From the output above, we can see that we have around 600 samples, which explains the 2400 garbage collections composing 90% of the runtime.

Here are the offenders:

https://github.com/bos/criterion/blob/46dabb3bb9d1857aec851e21dc1df7b79f3490ec/Criterion/Measurement.hs#L174-L190

https://github.com/bos/criterion/blob/46dabb3bb9d1857aec851e21dc1df7b79f3490ec/Criterion/Measurement.hs#L253-L254

The f count run in the second code block is calling the body of the first code block.

So at least 2 of the performGCs are useless when measure is called. The ones in runBenchmarkable are farther from the code, and so more eligible for being removed. The only code that calls runBenchmarkable but not measure is runFixedIters, which does not take any measurements. The only sticking point is that runBenchmarkable is exported, so this could impact someone using it.

That covers half of the time. The two options I see for the other half are:

  • Use performMinorGC (not sure if valid)
  • Make the cost opt-in

I'll need to experiment a bit to see how important those two garbage collections are to the actual timing of the benchmarks. It could be that the opt-in approach won't work.

I have a solution to reduce the total number of samples too, which will slightly soften the blow if we can't get the second half down.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions