Description
To whom it may concern...
Short story
An attempt to migrate our production to the LTS-20.26 (GHC 9.2.8) accidentally discovered a significant performance downfall.
Without the #127:
time 9.187 ms (9.138 ms .. 9.250 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 9.177 ms (9.138 ms .. 9.229 ms)
std dev 127.5 μs (81.71 μs .. 204.6 μs)
...and with it:
time 1.052 s (1.018 s .. 1.089 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.052 s (1.044 s .. 1.061 s)
std dev 9.869 ms (4.780 ms .. 13.74 ms)
Based on these (and many others, shared down below) observations, I tend to conclude that #127 performs about ~100 times slower.
Long story and more fruitful pieces of evidence
Once unwanted slowness was spotted, a drill-down analysis was carried out. Below I am going to describe every step taken and its outcome.
Step 1: perf
profiling
perf record -F 33 -g -e cycles:u -- $EXE +RTS -g2 -RTS
was used to profile the executable ($EXE
) built with the -O 2
flag to ensure maximum optimization. Both profiles then were compared against each other. The following was observed for the slowed-down version of the code:
+ 98,53% 98,53% benchmark benchmark [.] ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
Cleary, ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
stands for instance Ord
implementation. 98,53%
stands for "98,54% of all the userspace cycles were spent executing the ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
". That directed the investigation towards RTS
-based flag instrumentation in order to be able to get a deeper insight into where these comparisons come from.
Step 2: RTS
flags
$EXE +RTS -P -RTS
produced a Haskell-native runtime profile, which then was visualized with a flame graph using the ghc-prof-flamegraph
command line utility.
Comparing the flame graphs clearly highlighted that the parseLines :: HamletSettings -> String -> Result (Maybe NewlineStyle, HamletSettings, [(Int, Line)])
started to eat up significantly more CPU time 1.
That pointed here (to the Parsec
itself) and its changelog, where #127 popped up as a potential root cause. Reverting the #128 only (which fixed the #127) indeed rollbacks performance back to the expected measure. Trying to figure out what's happening it was decided to dump GHC intermediate representation.
Step 3: GHC
intermediate analysis
The problematic piece of code was built with the -ddump-simpl -dsuppress-all -dno-suppress-type-signatures
and those were compared. It turned out that the misperforming version generates more code of the following pattern:
(case s2_aeCm of { State ds7_aeCD ds8_aeCE ds9_aeCF ->
case ds8_aeCE of ww10_s1l0j
{ SourcePos ww11_s1l0k ww12_s1l0l ww13_s1l0m ->
case $fOrd[]_$s$ccompare1 ww4_s1l09 ww11_s1l0k of {
LT -> ...;
EQ -> ...;
GT -> ...;
}
}
})
and the $fOrd[]_$s$ccompare1
method called about 8 times more often. Unfortunately, I lack knowledge of Haskell internals and thus am unable to analyze the intermediate representation further.
The conclusion
To my understanding, these 3 steps advocate #127 to be the root cause. Yet I fail to explain how could that happen. As of now, a hacky hotfix allows us to migrate to the GHC 9.2.8
avoiding the regression, however, I would like to open a discussion on how one could possibly fix it. The open questions are:
- How much confidence do these three steps deliver? Would that be enough to consider (>>=) leaks memory #127 a root cause or some more steps to be taken?
- What would be a better way to study the output of the
-ddump-simpl -dsuppress-all -dno-suppress-type-signatures
flags and can it help to make the issues even more concrete?
P.S.
At the moment the code involved in instrumentation lies under an NDA. Thus, I do not share it intentionally due to legal restrictions. I hope the observations above will be sufficient for this discussion to progress.
Footnotes
-
The
shakespeare-2.1.0
is running in the production. ↩