Skip to content

The recent fix of the (>>=) memory leak seems to cause an enormous performance degradation.  #171

Closed
@silencespeakstruth

Description

@silencespeakstruth

To whom it may concern...

Short story

An attempt to migrate our production to the LTS-20.26 (GHC 9.2.8) accidentally discovered a significant performance downfall.

Without the #127:

time                 9.187 ms   (9.138 ms .. 9.250 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 9.177 ms   (9.138 ms .. 9.229 ms)
std dev              127.5 μs   (81.71 μs .. 204.6 μs)

...and with it:

time                 1.052 s    (1.018 s .. 1.089 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.052 s    (1.044 s .. 1.061 s)
std dev              9.869 ms   (4.780 ms .. 13.74 ms)

Based on these (and many others, shared down below) observations, I tend to conclude that #127 performs about ~100 times slower.

Long story and more fruitful pieces of evidence

Once unwanted slowness was spotted, a drill-down analysis was carried out. Below I am going to describe every step taken and its outcome.

Step 1: perf profiling

perf record -F 33 -g -e cycles:u -- $EXE +RTS -g2 -RTS was used to profile the executable ($EXE) built with the -O 2 flag to ensure maximum optimization. Both profiles then were compared against each other. The following was observed for the slowed-down version of the code:

+   98,53%    98,53%  benchmark  benchmark               [.] ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc

Cleary, ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc stands for instance Ord implementation. 98,53% stands for "98,54% of all the userspace cycles were spent executing the ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc". That directed the investigation towards RTS-based flag instrumentation in order to be able to get a deeper insight into where these comparisons come from.

Step 2: RTS flags

$EXE +RTS -P -RTS produced a Haskell-native runtime profile, which then was visualized with a flame graph using the ghc-prof-flamegraph command line utility.

Comparing the flame graphs clearly highlighted that the parseLines :: HamletSettings -> String -> Result (Maybe NewlineStyle, HamletSettings, [(Int, Line)]) started to eat up significantly more CPU time 1.

That pointed here (to the Parsec itself) and its changelog, where #127 popped up as a potential root cause. Reverting the #128 only (which fixed the #127) indeed rollbacks performance back to the expected measure. Trying to figure out what's happening it was decided to dump GHC intermediate representation.

Step 3: GHC intermediate analysis

The problematic piece of code was built with the -ddump-simpl -dsuppress-all -dno-suppress-type-signatures and those were compared. It turned out that the misperforming version generates more code of the following pattern:

(case s2_aeCm of { State ds7_aeCD ds8_aeCE ds9_aeCF ->
                  case ds8_aeCE of ww10_s1l0j
                  { SourcePos ww11_s1l0k ww12_s1l0l ww13_s1l0m ->
                  case $fOrd[]_$s$ccompare1 ww4_s1l09 ww11_s1l0k of {
                    LT -> ...;
                    EQ -> ...;
                    GT -> ...;
                  }
                  }
                  })

and the $fOrd[]_$s$ccompare1 method called about 8 times more often. Unfortunately, I lack knowledge of Haskell internals and thus am unable to analyze the intermediate representation further.

The conclusion

To my understanding, these 3 steps advocate #127 to be the root cause. Yet I fail to explain how could that happen. As of now, a hacky hotfix allows us to migrate to the GHC 9.2.8 avoiding the regression, however, I would like to open a discussion on how one could possibly fix it. The open questions are:

  1. How much confidence do these three steps deliver? Would that be enough to consider (>>=) leaks memory #127 a root cause or some more steps to be taken?
  2. What would be a better way to study the output of the -ddump-simpl -dsuppress-all -dno-suppress-type-signatures flags and can it help to make the issues even more concrete?

P.S.

At the moment the code involved in instrumentation lies under an NDA. Thus, I do not share it intentionally due to legal restrictions. I hope the observations above will be sufficient for this discussion to progress.

Footnotes

  1. The shakespeare-2.1.0 is running in the production.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions