Decrease impact of gc on benchmark performance (fixes #185) #187

patrickdoc · 2018-02-25T04:48:29Z

Using the example from issue #185...

Before:

benchmarking small-fast
bootstrapping with 39 of 312 samples (12%)
time                 6.877 ns   (6.724 ns .. 7.062 ns)
                     0.996 R²   (0.994 R² .. 0.997 R²)
mean                 6.898 ns   (6.758 ns .. 7.034 ns)
std dev              448.3 ps   (391.7 ps .. 540.9 ps)
variance introduced by outliers: 83% (severely inflated)

benchmarking large-slow
measurement took 69.20 s
bootstrapping with 16 of 289 samples (5%)
time                 6.380 ns   (6.241 ns .. 6.542 ns)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 6.453 ns   (6.267 ns .. 6.668 ns)
std dev              414.9 ps   (320.0 ps .. 548.2 ps)
variance introduced by outliers: 82% (severely inflated)

     354,607,016 bytes allocated in the heap
  74,880,263,968 bytes copied during GC
      64,231,016 bytes maximum residency (2430 sample(s))
         224,872 bytes maximum slop
             134 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       322 colls,     0 par    0.368s   0.433s     0.0013s    0.0824s
  Gen  1      2430 colls,     0 par   68.188s  68.226s     0.0281s    0.0709s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    6.440s  (  6.450s elapsed)
  GC      time   68.556s  ( 68.659s elapsed)
  EXIT    time    0.004s  (  0.007s elapsed)
  Total   time   75.000s  ( 75.117s elapsed)

  %GC     time      91.4%  (91.4% elapsed)

  Alloc rate    55,063,201 bytes per MUT second

  Productivity   8.6% of total user, 8.6% of total elapsed

After:

benchmarking small-fast
bootstrapping with 42 of 316 samples (13%)
time                 5.902 ns   (5.888 ns .. 5.924 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 6.015 ns   (5.944 ns .. 6.119 ns)
std dev              281.0 ps   (210.5 ps .. 338.7 ps)
found 9 outliers among 42 samples (21.4%)
  1 (2.4%) high mild
  8 (19.0%) high severe
variance introduced by outliers: 72% (severely inflated)

benchmarking large-slow
bootstrapping with 43 of 316 samples (13%)
time                 5.890 ns   (5.884 ns .. 5.898 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.913 ns   (5.893 ns .. 5.990 ns)
std dev              119.0 ps   (16.88 ps .. 251.8 ps)
found 2 outliers among 43 samples (4.7%)
  1 (2.3%) high mild
  1 (2.3%) high severe
variance introduced by outliers: 32% (moderately inflated)

     362,379,952 bytes allocated in the heap
     399,909,824 bytes copied during GC
      64,196,880 bytes maximum residency (14 sample(s))
         236,992 bytes maximum slop
             135 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1594 colls,     0 par    0.448s   0.468s     0.0003s    0.0897s
  Gen  1        14 colls,     0 par    0.008s   0.010s     0.0007s    0.0012s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   10.312s  ( 10.330s elapsed)
  GC      time    0.456s  (  0.478s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time   10.768s  ( 10.814s elapsed)

  %GC     time       4.2%  (4.4% elapsed)

  Alloc rate    35,141,577 bytes per MUT second

  Productivity  95.8% of total user, 95.6% of total elapsed

Total time down from 75s -> 10s, 74.5GB -> 0.3GB copied in GC. But most other stats remained quite similar.

patrickdoc · 2018-02-25T05:26:53Z

Ah, tragically, performMinorGc wasn't implemented/exported until base 4.7.0. I've added the CPP to let it build, but it's not the most elegant. Let me know if you would like something different.

RyanGlScott

Ah, tragically, performMinorGc wasn't implemented/exported until base 4.7.0. I've added the CPP to let it build, but it's not the most elegant. Let me know if you would like something different.

I think we could hack around this without too much trouble. Even though performMinorGC wasn't exported on versions of base older than 4.7.0, it was still defined in the RTS, so we could be able to use it simply by using a foreign import. Something like this should suffice, I think:

#if !(MIN_VERSION_base(4,7,0))
foreign import ccall "performGC" performMinorGC :: IO ()
#endif

(Confusingly, performGC from base refers to performMajorGC in the RTS, and performMinorGC in base refers to performGC in the RTS. It took me a while to realize this when searching for it!)

RyanGlScott · 2018-02-25T14:17:53Z

Criterion/Measurement.hs

  performGC
+#endif


I'm not intimately familiar with the differences between performGc and performMinorGc. Is there some way we can be confident that it will update all of the appropriate statistics? For instance, these changes were made in #177, which were specifically aimed at fixing the measurement of the regression stats.

Very roughly, the difference between performMajorGC and performMinorGC is just a boolean flag in the rts. They both end up calling stat_endGC which updates all of the stats.

https://github.com/ghc/ghc/blob/bf3f0a6808b89504a993dc9f2839daff9eaf6f2a/rts/Stats.c#L281-L294

It's a rather long function, so I didn't link the whole thing here.

We can also eyeball the measured results. Allocation is set by endStatsPostGc, so if that is accurate, then we should be good.

Here's a sample of the benchmarks I have been running to check if things are working:

[ bench "Bool" $ whnf (toEnum :: Int -> Bool) 0 -- ^ Should be 0, True and False are shared in memory , bench "Char" $ whnf chr 0 -- ^ Should be 2 words, a single Char ]

Regressing on allocation with --regress allocated:iters, we can see that both give the same results. Strangely, iters indicates allocation per iteration, so that field should match the estimates in the comments above. y is the overhead, which doesn't affect the estimate. When 0 bytes are allocated, the results don't look amazing because of some noise in the stats, but 3e-6 is pretty close to 0.

Using performMajorGC:

benchmarking primitives/Bool time 4.378 ns (4.262 ns .. 4.518 ns) 0.995 R² (0.993 R² .. 0.998 R²) mean 4.445 ns (4.354 ns .. 4.583 ns) std dev 370.4 ps (283.0 ps .. 478.8 ps) allocated: 0.000 R² (0.000 R² .. 0.016 R²) iters 1.165e-7 (-1.934e-5 .. 2.305e-5) y 2403.036 (2222.900 .. 2578.186) benchmarking primitives/Char time 6.635 ns (6.613 ns .. 6.660 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 6.619 ns (6.595 ns .. 6.645 ns) std dev 81.02 ps (70.75 ps .. 93.33 ps) allocated: 1.000 R² (1.000 R² .. 1.000 R²) iters 16.000 (16.000 .. 16.000) y 1616.000 (1616.000 .. 1616.000)

Using performMinorGC:

benchmarking primitives/Bool time 5.259 ns (5.183 ns .. 5.344 ns) 0.999 R² (0.998 R² .. 1.000 R²) mean 5.228 ns (5.198 ns .. 5.270 ns) std dev 121.3 ps (85.08 ps .. 171.5 ps) allocated: 0.000 R² (0.000 R² .. 0.013 R²) iters -3.303e-6 (-2.181e-5 .. 1.898e-5) y 2408.118 (2225.315 .. 2593.435) benchmarking primitives/Char time 6.620 ns (6.591 ns .. 6.647 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 6.606 ns (6.580 ns .. 6.641 ns) std dev 100.2 ps (79.13 ps .. 128.8 ps) allocated: 1.000 R² (1.000 R² .. 1.000 R²) iters 16.000 (16.000 .. 16.000) y 2411.067 (2216.509 .. 2616.250)

I've checked with some larger structures too, like lazy and strict lists.

Looking at the implementation of stat_endGC, this part has me worried:

if (gen == RtsFlags.GcFlags.generations-1) { // major GC? stats.major_gcs++; if (stats.gc.live_bytes > stats.max_live_bytes) { stats.max_live_bytes = stats.gc.live_bytes; } if (stats.gc.large_objects_bytes > stats.max_large_objects_bytes) { stats.max_large_objects_bytes = stats.gc.large_objects_bytes; } if (stats.gc.compact_bytes > stats.max_compact_bytes) { stats.max_compact_bytes = stats.gc.compact_bytes; } if (stats.gc.slop_bytes > stats.max_slop_bytes) { stats.max_slop_bytes = stats.gc.slop_bytes; } stats.cumulative_live_bytes += stats.gc.live_bytes; }

To my (admittedly untrained) eye, this appears as if statistics like max_live_bytes (which criterion does record, in the form of gcStatsMaxBytesUsed) are only updated during a GC if that GC happens to be major. Is my understanding of this correct?

You are correct that these stats are only updated in a major GC. (you learn something new every 30s in the rts...) However, the measure function only uses a portion of the GCStatistics structure. The only data that gets used from GCStatistics is outlined in applyGCStatistics, here.

We don't care about the endPreGC because those cannot rely on a GC for updating. That leaves the following (with RTSStats names after each):

gcStatsBytesAllocated -> allocated_bytes

gcStatsBytesCopied -> copied_bytes

gcStatsMutatorWallSeconds -> mutator_elapsed_ns

gcStatsMutatorCpuSeconds -> mutator_cpu_ns

Looking to stat_endGC now, these are updated on lines:

allocated_bytes -> 337

copied_bytes -> 346

The mutator stats are actually updated in getRTSStats itself:

mutator_elapsed_ns -> 1015

mutator_cpu_ns -> 1016, same as ^

The fact that they are updated there makes me wonder how getGCStats used to do all this. I will investigate and make another comment double checking those.

First, conversions from GCStatistics to internal names:

gcStatsBytesAllocated -> bytes_allocated

gcStatsBytesCopied -> bytes_copied

gcStatsMutatorWallSeconds -> mutator_wall_seconds

gcStatsMutatorCpuSeconds -> mutator_cpu_seconds

Next, update locations. getGCStats mostly just reads from static global variables. These are the locations where those variables are updated:

bytes_allocated -> stat_endGC:414

bytes_copied -> stat_endGC:443

mutator_wall_seconds -> getGCStats:981

mutator_cpu_seconds -> getGCStats:982

Once again, the mutator time could (should?) probably just be read from the pre-gc stats, but nothing in the code base uses those stats. I'm not even sure they can be accessed, so I won't get sidetracked.

To put a tidy bow on all this: all data extracted from endStatsPostGC by measure is up to date regardless of using performMajorGC or performMinorGC.

Thank you for the extremely thorough rundown! That definitely assuaged my fears about performMinorGC.

In that case, I request that you make the following two changes:

Conditionally define performMinorGC using CPP, as suggested in Decrease impact of gc on benchmark performance (fixes #185) #187 (review) .

Add a brief comment (somewhere around measure) which states that performMinorGC updates all of the RTS stats we care about, so we can get away with using that instead of performGC.

After that, I'll merge.

RyanGlScott · 2018-02-26T18:01:14Z

It looks like the CI is failing due snoyberg/conduit#353. Let me see if that's easy to fix...

patrickdoc · 2018-02-26T18:06:14Z

Ah, I rebased in case I had missed something, but it seems I just sent the CI for a wasted build.

RyanGlScott · 2018-02-26T18:32:46Z

Might you be willing to rebase on top of master again? I believe bd5898b should fix the previous build failures.

patrickdoc · 2018-02-26T18:55:24Z

Well you now know that I don't know how to use ffi correctly... One last fix should do it.

RyanGlScott · 2018-02-26T19:28:18Z

Great! Thanks, @patrickdoc!

RyanGlScott reviewed Feb 25, 2018

View reviewed changes

patrickdoc added 5 commits February 26, 2018 12:34

Remove extra performGC's, change the others to performMinorGC

2624b0c

Document performMinorGc in changelog

7b26864

Add CPP guards to fix build on GHC 7.6 and 7.4

8788536

Fix formatting

c952a8e

Cleanup CPP and comment on performMinorGC

fa7f588

Fix performMinorGC foreign import

14dbeae

RyanGlScott merged commit 16c42bd into haskell:master Feb 26, 2018

patrickdoc deleted the less-gc branch February 26, 2018 19:35

patrickdoc mentioned this pull request Feb 27, 2018

Final measurement is dropped #179

Closed

RyanGlScott mentioned this pull request Jan 23, 2022

use performMajorGC #257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decrease impact of gc on benchmark performance (fixes #185) #187

Decrease impact of gc on benchmark performance (fixes #185) #187

Uh oh!

patrickdoc commented Feb 25, 2018

Uh oh!

patrickdoc commented Feb 25, 2018

Uh oh!

RyanGlScott left a comment

Uh oh!

RyanGlScott Feb 25, 2018

Uh oh!

patrickdoc Feb 25, 2018

Uh oh!

RyanGlScott Feb 25, 2018

Uh oh!

patrickdoc Feb 26, 2018 •

edited

Loading

Uh oh!

patrickdoc Feb 26, 2018

Uh oh!

RyanGlScott Feb 26, 2018

Uh oh!

RyanGlScott commented Feb 26, 2018

Uh oh!

patrickdoc commented Feb 26, 2018

Uh oh!

RyanGlScott commented Feb 26, 2018

Uh oh!

patrickdoc commented Feb 26, 2018

Uh oh!

RyanGlScott commented Feb 26, 2018

Uh oh!

Uh oh!

Decrease impact of gc on benchmark performance (fixes #185) #187

Decrease impact of gc on benchmark performance (fixes #185) #187

Uh oh!

Conversation

patrickdoc commented Feb 25, 2018

Uh oh!

patrickdoc commented Feb 25, 2018

Uh oh!

RyanGlScott left a comment

Choose a reason for hiding this comment

Uh oh!

RyanGlScott Feb 25, 2018

Choose a reason for hiding this comment

Uh oh!

patrickdoc Feb 25, 2018

Choose a reason for hiding this comment

Uh oh!

RyanGlScott Feb 25, 2018

Choose a reason for hiding this comment

Uh oh!

patrickdoc Feb 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickdoc Feb 26, 2018

Choose a reason for hiding this comment

Uh oh!

RyanGlScott Feb 26, 2018

Choose a reason for hiding this comment

Uh oh!

RyanGlScott commented Feb 26, 2018

Uh oh!

patrickdoc commented Feb 26, 2018

Uh oh!

RyanGlScott commented Feb 26, 2018

Uh oh!

patrickdoc commented Feb 26, 2018

Uh oh!

RyanGlScott commented Feb 26, 2018

Uh oh!

Uh oh!

patrickdoc Feb 26, 2018 •

edited

Loading