-
Notifications
You must be signed in to change notification settings - Fork 88
Decrease impact of gc on benchmark performance (fixes #185) #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ah, tragically, |
RyanGlScott
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, tragically,
performMinorGcwasn't implemented/exported until base 4.7.0. I've added the CPP to let it build, but it's not the most elegant. Let me know if you would like something different.
I think we could hack around this without too much trouble. Even though performMinorGC wasn't exported on versions of base older than 4.7.0, it was still defined in the RTS, so we could be able to use it simply by using a foreign import. Something like this should suffice, I think:
#if !(MIN_VERSION_base(4,7,0))
foreign import ccall "performGC" performMinorGC :: IO ()
#endif(Confusingly, performGC from base refers to performMajorGC in the RTS, and performMinorGC in base refers to performGC in the RTS. It took me a while to realize this when searching for it!)
Criterion/Measurement.hs
Outdated
| performMinorGC | ||
| #else | ||
| performGC | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not intimately familiar with the differences between performGc and performMinorGc. Is there some way we can be confident that it will update all of the appropriate statistics? For instance, these changes were made in #177, which were specifically aimed at fixing the measurement of the regression stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very roughly, the difference between performMajorGC and performMinorGC is just a boolean flag in the rts. They both end up calling stat_endGC which updates all of the stats.
It's a rather long function, so I didn't link the whole thing here.
We can also eyeball the measured results. Allocation is set by endStatsPostGc, so if that is accurate, then we should be good.
Here's a sample of the benchmarks I have been running to check if things are working:
[ bench "Bool" $ whnf (toEnum :: Int -> Bool) 0
-- ^ Should be 0, True and False are shared in memory
, bench "Char" $ whnf chr 0
-- ^ Should be 2 words, a single Char
]Regressing on allocation with --regress allocated:iters, we can see that both give the same results. Strangely, iters indicates allocation per iteration, so that field should match the estimates in the comments above. y is the overhead, which doesn't affect the estimate. When 0 bytes are allocated, the results don't look amazing because of some noise in the stats, but 3e-6 is pretty close to 0.
Using performMajorGC:
benchmarking primitives/Bool
time 4.378 ns (4.262 ns .. 4.518 ns)
0.995 R² (0.993 R² .. 0.998 R²)
mean 4.445 ns (4.354 ns .. 4.583 ns)
std dev 370.4 ps (283.0 ps .. 478.8 ps)
allocated: 0.000 R² (0.000 R² .. 0.016 R²)
iters 1.165e-7 (-1.934e-5 .. 2.305e-5)
y 2403.036 (2222.900 .. 2578.186)
benchmarking primitives/Char
time 6.635 ns (6.613 ns .. 6.660 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 6.619 ns (6.595 ns .. 6.645 ns)
std dev 81.02 ps (70.75 ps .. 93.33 ps)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 16.000 (16.000 .. 16.000)
y 1616.000 (1616.000 .. 1616.000)
Using performMinorGC:
benchmarking primitives/Bool
time 5.259 ns (5.183 ns .. 5.344 ns)
0.999 R² (0.998 R² .. 1.000 R²)
mean 5.228 ns (5.198 ns .. 5.270 ns)
std dev 121.3 ps (85.08 ps .. 171.5 ps)
allocated: 0.000 R² (0.000 R² .. 0.013 R²)
iters -3.303e-6 (-2.181e-5 .. 1.898e-5)
y 2408.118 (2225.315 .. 2593.435)
benchmarking primitives/Char
time 6.620 ns (6.591 ns .. 6.647 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 6.606 ns (6.580 ns .. 6.641 ns)
std dev 100.2 ps (79.13 ps .. 128.8 ps)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 16.000 (16.000 .. 16.000)
y 2411.067 (2216.509 .. 2616.250)
I've checked with some larger structures too, like lazy and strict lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the implementation of stat_endGC, this part has me worried:
if (gen == RtsFlags.GcFlags.generations-1) { // major GC?
stats.major_gcs++;
if (stats.gc.live_bytes > stats.max_live_bytes) {
stats.max_live_bytes = stats.gc.live_bytes;
}
if (stats.gc.large_objects_bytes > stats.max_large_objects_bytes) {
stats.max_large_objects_bytes = stats.gc.large_objects_bytes;
}
if (stats.gc.compact_bytes > stats.max_compact_bytes) {
stats.max_compact_bytes = stats.gc.compact_bytes;
}
if (stats.gc.slop_bytes > stats.max_slop_bytes) {
stats.max_slop_bytes = stats.gc.slop_bytes;
}
stats.cumulative_live_bytes += stats.gc.live_bytes;
}To my (admittedly untrained) eye, this appears as if statistics like max_live_bytes (which criterion does record, in the form of gcStatsMaxBytesUsed) are only updated during a GC if that GC happens to be major. Is my understanding of this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that these stats are only updated in a major GC. (you learn something new every 30s in the rts...) However, the measure function only uses a portion of the GCStatistics structure. The only data that gets used from GCStatistics is outlined in applyGCStatistics, here.
We don't care about the endPreGC because those cannot rely on a GC for updating. That leaves the following (with RTSStats names after each):
gcStatsBytesAllocated->allocated_bytesgcStatsBytesCopied->copied_bytesgcStatsMutatorWallSeconds->mutator_elapsed_nsgcStatsMutatorCpuSeconds->mutator_cpu_ns
Looking to stat_endGC now, these are updated on lines:
The mutator stats are actually updated in getRTSStats itself:
mutator_elapsed_ns-> 1015mutator_cpu_ns-> 1016, same as ^
The fact that they are updated there makes me wonder how getGCStats used to do all this. I will investigate and make another comment double checking those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, conversions from GCStatistics to internal names:
gcStatsBytesAllocated->bytes_allocatedgcStatsBytesCopied->bytes_copiedgcStatsMutatorWallSeconds->mutator_wall_secondsgcStatsMutatorCpuSeconds->mutator_cpu_seconds
Next, update locations. getGCStats mostly just reads from static global variables. These are the locations where those variables are updated:
bytes_allocated-> stat_endGC:414bytes_copied-> stat_endGC:443mutator_wall_seconds-> getGCStats:981mutator_cpu_seconds-> getGCStats:982
Once again, the mutator time could (should?) probably just be read from the pre-gc stats, but nothing in the code base uses those stats. I'm not even sure they can be accessed, so I won't get sidetracked.
To put a tidy bow on all this: all data extracted from endStatsPostGC by measure is up to date regardless of using performMajorGC or performMinorGC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the extremely thorough rundown! That definitely assuaged my fears about performMinorGC.
In that case, I request that you make the following two changes:
- Conditionally define
performMinorGCusing CPP, as suggested in Decrease impact of gc on benchmark performance (fixes #185) #187 (review) . - Add a brief comment (somewhere around
measure) which states thatperformMinorGCupdates all of the RTS stats we care about, so we can get away with using that instead ofperformGC.
After that, I'll merge.
|
It looks like the CI is failing due snoyberg/conduit#353. Let me see if that's easy to fix... |
|
Ah, I rebased in case I had missed something, but it seems I just sent the CI for a wasted build. |
|
Might you be willing to rebase on top of |
|
Well you now know that I don't know how to use ffi correctly... One last fix should do it. |
|
Great! Thanks, @patrickdoc! |
Using the example from issue #185...
Before:
After:
Total time down from 75s -> 10s, 74.5GB -> 0.3GB copied in GC. But most other stats remained quite similar.