Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease impact of gc on benchmark performance (fixes #185) #187

Merged
merged 6 commits into from
Feb 26, 2018
Merged

Decrease impact of gc on benchmark performance (fixes #185) #187

merged 6 commits into from
Feb 26, 2018

Conversation

patrickdoc
Copy link
Contributor

Using the example from issue #185...

Before:

benchmarking small-fast
bootstrapping with 39 of 312 samples (12%)
time                 6.877 ns   (6.724 ns .. 7.062 ns)
                     0.996 R²   (0.994 R² .. 0.997 R²)
mean                 6.898 ns   (6.758 ns .. 7.034 ns)
std dev              448.3 ps   (391.7 ps .. 540.9 ps)
variance introduced by outliers: 83% (severely inflated)

benchmarking large-slow
measurement took 69.20 s
bootstrapping with 16 of 289 samples (5%)
time                 6.380 ns   (6.241 ns .. 6.542 ns)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 6.453 ns   (6.267 ns .. 6.668 ns)
std dev              414.9 ps   (320.0 ps .. 548.2 ps)
variance introduced by outliers: 82% (severely inflated)

     354,607,016 bytes allocated in the heap
  74,880,263,968 bytes copied during GC
      64,231,016 bytes maximum residency (2430 sample(s))
         224,872 bytes maximum slop
             134 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       322 colls,     0 par    0.368s   0.433s     0.0013s    0.0824s
  Gen  1      2430 colls,     0 par   68.188s  68.226s     0.0281s    0.0709s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    6.440s  (  6.450s elapsed)
  GC      time   68.556s  ( 68.659s elapsed)
  EXIT    time    0.004s  (  0.007s elapsed)
  Total   time   75.000s  ( 75.117s elapsed)

  %GC     time      91.4%  (91.4% elapsed)

  Alloc rate    55,063,201 bytes per MUT second

  Productivity   8.6% of total user, 8.6% of total elapsed

After:

benchmarking small-fast
bootstrapping with 42 of 316 samples (13%)
time                 5.902 ns   (5.888 ns .. 5.924 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 6.015 ns   (5.944 ns .. 6.119 ns)
std dev              281.0 ps   (210.5 ps .. 338.7 ps)
found 9 outliers among 42 samples (21.4%)
  1 (2.4%) high mild
  8 (19.0%) high severe
variance introduced by outliers: 72% (severely inflated)

benchmarking large-slow
bootstrapping with 43 of 316 samples (13%)
time                 5.890 ns   (5.884 ns .. 5.898 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.913 ns   (5.893 ns .. 5.990 ns)
std dev              119.0 ps   (16.88 ps .. 251.8 ps)
found 2 outliers among 43 samples (4.7%)
  1 (2.3%) high mild
  1 (2.3%) high severe
variance introduced by outliers: 32% (moderately inflated)

     362,379,952 bytes allocated in the heap
     399,909,824 bytes copied during GC
      64,196,880 bytes maximum residency (14 sample(s))
         236,992 bytes maximum slop
             135 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1594 colls,     0 par    0.448s   0.468s     0.0003s    0.0897s
  Gen  1        14 colls,     0 par    0.008s   0.010s     0.0007s    0.0012s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   10.312s  ( 10.330s elapsed)
  GC      time    0.456s  (  0.478s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time   10.768s  ( 10.814s elapsed)

  %GC     time       4.2%  (4.4% elapsed)

  Alloc rate    35,141,577 bytes per MUT second

  Productivity  95.8% of total user, 95.6% of total elapsed

Total time down from 75s -> 10s, 74.5GB -> 0.3GB copied in GC. But most other stats remained quite similar.

@patrickdoc
Copy link
Contributor Author

Ah, tragically, performMinorGc wasn't implemented/exported until base 4.7.0. I've added the CPP to let it build, but it's not the most elegant. Let me know if you would like something different.

Copy link
Member

@RyanGlScott RyanGlScott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, tragically, performMinorGc wasn't implemented/exported until base 4.7.0. I've added the CPP to let it build, but it's not the most elegant. Let me know if you would like something different.

I think we could hack around this without too much trouble. Even though performMinorGC wasn't exported on versions of base older than 4.7.0, it was still defined in the RTS, so we could be able to use it simply by using a foreign import. Something like this should suffice, I think:

#if !(MIN_VERSION_base(4,7,0))
foreign import ccall "performGC" performMinorGC :: IO ()
#endif

(Confusingly, performGC from base refers to performMajorGC in the RTS, and performMinorGC in base refers to performGC in the RTS. It took me a while to realize this when searching for it!)

performGC
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not intimately familiar with the differences between performGc and performMinorGc. Is there some way we can be confident that it will update all of the appropriate statistics? For instance, these changes were made in #177, which were specifically aimed at fixing the measurement of the regression stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very roughly, the difference between performMajorGC and performMinorGC is just a boolean flag in the rts. They both end up calling stat_endGC which updates all of the stats.

https://github.com/ghc/ghc/blob/bf3f0a6808b89504a993dc9f2839daff9eaf6f2a/rts/Stats.c#L281-L294

It's a rather long function, so I didn't link the whole thing here.

We can also eyeball the measured results. Allocation is set by endStatsPostGc, so if that is accurate, then we should be good.

Here's a sample of the benchmarks I have been running to check if things are working:

[ bench "Bool" $ whnf (toEnum :: Int -> Bool) 0
  -- ^ Should be 0, True and False are shared in memory
, bench "Char" $ whnf chr 0
  -- ^ Should be 2 words, a single Char
]

Regressing on allocation with --regress allocated:iters, we can see that both give the same results. Strangely, iters indicates allocation per iteration, so that field should match the estimates in the comments above. y is the overhead, which doesn't affect the estimate. When 0 bytes are allocated, the results don't look amazing because of some noise in the stats, but 3e-6 is pretty close to 0.

Using performMajorGC:

benchmarking primitives/Bool
time                 4.378 ns   (4.262 ns .. 4.518 ns)
                     0.995 R²   (0.993 R² .. 0.998 R²)
mean                 4.445 ns   (4.354 ns .. 4.583 ns)
std dev              370.4 ps   (283.0 ps .. 478.8 ps)
allocated:           0.000 R²   (0.000 R² .. 0.016 R²)
  iters              1.165e-7   (-1.934e-5 .. 2.305e-5)
  y                  2403.036   (2222.900 .. 2578.186)

benchmarking primitives/Char
time                 6.635 ns   (6.613 ns .. 6.660 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.619 ns   (6.595 ns .. 6.645 ns)
std dev              81.02 ps   (70.75 ps .. 93.33 ps)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              16.000     (16.000 .. 16.000)
  y                  1616.000   (1616.000 .. 1616.000)

Using performMinorGC:

benchmarking primitives/Bool
time                 5.259 ns   (5.183 ns .. 5.344 ns)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 5.228 ns   (5.198 ns .. 5.270 ns)
std dev              121.3 ps   (85.08 ps .. 171.5 ps)
allocated:           0.000 R²   (0.000 R² .. 0.013 R²)
  iters              -3.303e-6  (-2.181e-5 .. 1.898e-5)
  y                  2408.118   (2225.315 .. 2593.435)


benchmarking primitives/Char
time                 6.620 ns   (6.591 ns .. 6.647 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.606 ns   (6.580 ns .. 6.641 ns)
std dev              100.2 ps   (79.13 ps .. 128.8 ps)
allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
  iters              16.000     (16.000 .. 16.000)
  y                  2411.067   (2216.509 .. 2616.250)

I've checked with some larger structures too, like lazy and strict lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation of stat_endGC, this part has me worried:

    if (gen == RtsFlags.GcFlags.generations-1) { // major GC?
        stats.major_gcs++;
        if (stats.gc.live_bytes > stats.max_live_bytes) {
            stats.max_live_bytes = stats.gc.live_bytes;
        }
        if (stats.gc.large_objects_bytes > stats.max_large_objects_bytes) {
            stats.max_large_objects_bytes = stats.gc.large_objects_bytes;
        }
        if (stats.gc.compact_bytes > stats.max_compact_bytes) {
            stats.max_compact_bytes = stats.gc.compact_bytes;
        }
        if (stats.gc.slop_bytes > stats.max_slop_bytes) {
            stats.max_slop_bytes = stats.gc.slop_bytes;
        }
        stats.cumulative_live_bytes += stats.gc.live_bytes;
    }

To my (admittedly untrained) eye, this appears as if statistics like max_live_bytes (which criterion does record, in the form of gcStatsMaxBytesUsed) are only updated during a GC if that GC happens to be major. Is my understanding of this correct?

Copy link
Contributor Author

@patrickdoc patrickdoc Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct that these stats are only updated in a major GC. (you learn something new every 30s in the rts...) However, the measure function only uses a portion of the GCStatistics structure. The only data that gets used from GCStatistics is outlined in applyGCStatistics, here.

We don't care about the endPreGC because those cannot rely on a GC for updating. That leaves the following (with RTSStats names after each):

  • gcStatsBytesAllocated -> allocated_bytes
  • gcStatsBytesCopied -> copied_bytes
  • gcStatsMutatorWallSeconds -> mutator_elapsed_ns
  • gcStatsMutatorCpuSeconds -> mutator_cpu_ns

Looking to stat_endGC now, these are updated on lines:

  • allocated_bytes -> 337
  • copied_bytes -> 346

The mutator stats are actually updated in getRTSStats itself:

  • mutator_elapsed_ns -> 1015
  • mutator_cpu_ns -> 1016, same as ^

The fact that they are updated there makes me wonder how getGCStats used to do all this. I will investigate and make another comment double checking those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, conversions from GCStatistics to internal names:

  • gcStatsBytesAllocated -> bytes_allocated
  • gcStatsBytesCopied -> bytes_copied
  • gcStatsMutatorWallSeconds -> mutator_wall_seconds
  • gcStatsMutatorCpuSeconds -> mutator_cpu_seconds

Next, update locations. getGCStats mostly just reads from static global variables. These are the locations where those variables are updated:

Once again, the mutator time could (should?) probably just be read from the pre-gc stats, but nothing in the code base uses those stats. I'm not even sure they can be accessed, so I won't get sidetracked.

To put a tidy bow on all this: all data extracted from endStatsPostGC by measure is up to date regardless of using performMajorGC or performMinorGC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the extremely thorough rundown! That definitely assuaged my fears about performMinorGC.

In that case, I request that you make the following two changes:

After that, I'll merge.

@RyanGlScott
Copy link
Member

It looks like the CI is failing due snoyberg/conduit#353. Let me see if that's easy to fix...

@patrickdoc
Copy link
Contributor Author

Ah, I rebased in case I had missed something, but it seems I just sent the CI for a wasted build.

@RyanGlScott
Copy link
Member

Might you be willing to rebase on top of master again? I believe bd5898b should fix the previous build failures.

@patrickdoc
Copy link
Contributor Author

Well you now know that I don't know how to use ffi correctly... One last fix should do it.

@RyanGlScott
Copy link
Member

Great! Thanks, @patrickdoc!

@RyanGlScott RyanGlScott merged commit 16c42bd into haskell:master Feb 26, 2018
@patrickdoc patrickdoc deleted the less-gc branch February 26, 2018 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants