-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease impact of gc on benchmark performance (fixes #185) #187
Conversation
Ah, tragically, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, tragically,
performMinorGc
wasn't implemented/exported until base 4.7.0. I've added the CPP to let it build, but it's not the most elegant. Let me know if you would like something different.
I think we could hack around this without too much trouble. Even though performMinorGC
wasn't exported on versions of base
older than 4.7.0, it was still defined in the RTS, so we could be able to use it simply by using a foreign import. Something like this should suffice, I think:
#if !(MIN_VERSION_base(4,7,0))
foreign import ccall "performGC" performMinorGC :: IO ()
#endif
(Confusingly, performGC
from base
refers to performMajorGC
in the RTS, and performMinorGC
in base
refers to performGC
in the RTS. It took me a while to realize this when searching for it!)
Criterion/Measurement.hs
Outdated
performGC | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not intimately familiar with the differences between performGc
and performMinorGc
. Is there some way we can be confident that it will update all of the appropriate statistics? For instance, these changes were made in #177, which were specifically aimed at fixing the measurement of the regression stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very roughly, the difference between performMajorGC
and performMinorGC
is just a boolean flag in the rts. They both end up calling stat_endGC
which updates all of the stats.
It's a rather long function, so I didn't link the whole thing here.
We can also eyeball the measured results. Allocation is set by endStatsPostGc
, so if that is accurate, then we should be good.
Here's a sample of the benchmarks I have been running to check if things are working:
[ bench "Bool" $ whnf (toEnum :: Int -> Bool) 0
-- ^ Should be 0, True and False are shared in memory
, bench "Char" $ whnf chr 0
-- ^ Should be 2 words, a single Char
]
Regressing on allocation with --regress allocated:iters
, we can see that both give the same results. Strangely, iters
indicates allocation per iteration, so that field should match the estimates in the comments above. y
is the overhead, which doesn't affect the estimate. When 0 bytes are allocated, the results don't look amazing because of some noise in the stats, but 3e-6 is pretty close to 0.
Using performMajorGC
:
benchmarking primitives/Bool
time 4.378 ns (4.262 ns .. 4.518 ns)
0.995 R² (0.993 R² .. 0.998 R²)
mean 4.445 ns (4.354 ns .. 4.583 ns)
std dev 370.4 ps (283.0 ps .. 478.8 ps)
allocated: 0.000 R² (0.000 R² .. 0.016 R²)
iters 1.165e-7 (-1.934e-5 .. 2.305e-5)
y 2403.036 (2222.900 .. 2578.186)
benchmarking primitives/Char
time 6.635 ns (6.613 ns .. 6.660 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 6.619 ns (6.595 ns .. 6.645 ns)
std dev 81.02 ps (70.75 ps .. 93.33 ps)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 16.000 (16.000 .. 16.000)
y 1616.000 (1616.000 .. 1616.000)
Using performMinorGC
:
benchmarking primitives/Bool
time 5.259 ns (5.183 ns .. 5.344 ns)
0.999 R² (0.998 R² .. 1.000 R²)
mean 5.228 ns (5.198 ns .. 5.270 ns)
std dev 121.3 ps (85.08 ps .. 171.5 ps)
allocated: 0.000 R² (0.000 R² .. 0.013 R²)
iters -3.303e-6 (-2.181e-5 .. 1.898e-5)
y 2408.118 (2225.315 .. 2593.435)
benchmarking primitives/Char
time 6.620 ns (6.591 ns .. 6.647 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 6.606 ns (6.580 ns .. 6.641 ns)
std dev 100.2 ps (79.13 ps .. 128.8 ps)
allocated: 1.000 R² (1.000 R² .. 1.000 R²)
iters 16.000 (16.000 .. 16.000)
y 2411.067 (2216.509 .. 2616.250)
I've checked with some larger structures too, like lazy and strict lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the implementation of stat_endGC
, this part has me worried:
if (gen == RtsFlags.GcFlags.generations-1) { // major GC?
stats.major_gcs++;
if (stats.gc.live_bytes > stats.max_live_bytes) {
stats.max_live_bytes = stats.gc.live_bytes;
}
if (stats.gc.large_objects_bytes > stats.max_large_objects_bytes) {
stats.max_large_objects_bytes = stats.gc.large_objects_bytes;
}
if (stats.gc.compact_bytes > stats.max_compact_bytes) {
stats.max_compact_bytes = stats.gc.compact_bytes;
}
if (stats.gc.slop_bytes > stats.max_slop_bytes) {
stats.max_slop_bytes = stats.gc.slop_bytes;
}
stats.cumulative_live_bytes += stats.gc.live_bytes;
}
To my (admittedly untrained) eye, this appears as if statistics like max_live_bytes
(which criterion
does record, in the form of gcStatsMaxBytesUsed
) are only updated during a GC if that GC happens to be major. Is my understanding of this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that these stats are only updated in a major GC. (you learn something new every 30s in the rts...) However, the measure
function only uses a portion of the GCStatistics
structure. The only data that gets used from GCStatistics
is outlined in applyGCStatistics
, here.
We don't care about the endPreGC
because those cannot rely on a GC for updating. That leaves the following (with RTSStats names after each):
gcStatsBytesAllocated
->allocated_bytes
gcStatsBytesCopied
->copied_bytes
gcStatsMutatorWallSeconds
->mutator_elapsed_ns
gcStatsMutatorCpuSeconds
->mutator_cpu_ns
Looking to stat_endGC
now, these are updated on lines:
The mutator stats are actually updated in getRTSStats
itself:
mutator_elapsed_ns
-> 1015mutator_cpu_ns
-> 1016, same as ^
The fact that they are updated there makes me wonder how getGCStats
used to do all this. I will investigate and make another comment double checking those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, conversions from GCStatistics
to internal names:
gcStatsBytesAllocated
->bytes_allocated
gcStatsBytesCopied
->bytes_copied
gcStatsMutatorWallSeconds
->mutator_wall_seconds
gcStatsMutatorCpuSeconds
->mutator_cpu_seconds
Next, update locations. getGCStats
mostly just reads from static global variables. These are the locations where those variables are updated:
bytes_allocated
-> stat_endGC:414bytes_copied
-> stat_endGC:443mutator_wall_seconds
-> getGCStats:981mutator_cpu_seconds
-> getGCStats:982
Once again, the mutator time could (should?) probably just be read from the pre-gc stats, but nothing in the code base uses those stats. I'm not even sure they can be accessed, so I won't get sidetracked.
To put a tidy bow on all this: all data extracted from endStatsPostGC
by measure
is up to date regardless of using performMajorGC
or performMinorGC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the extremely thorough rundown! That definitely assuaged my fears about performMinorGC
.
In that case, I request that you make the following two changes:
- Conditionally define
performMinorGC
using CPP, as suggested in Decrease impact of gc on benchmark performance (fixes #185) #187 (review) . - Add a brief comment (somewhere around
measure
) which states thatperformMinorGC
updates all of the RTS stats we care about, so we can get away with using that instead ofperformGC
.
After that, I'll merge.
It looks like the CI is failing due snoyberg/conduit#353. Let me see if that's easy to fix... |
Ah, I rebased in case I had missed something, but it seems I just sent the CI for a wasted build. |
Might you be willing to rebase on top of |
Well you now know that I don't know how to use ffi correctly... One last fix should do it. |
Great! Thanks, @patrickdoc! |
Using the example from issue #185...
Before:
After:
Total time down from 75s -> 10s, 74.5GB -> 0.3GB copied in GC. But most other stats remained quite similar.