The recent fix of the `(>>=)` memory leak seems to cause an enormous performance degradation. 

To whom it may concern...

# Short story

An attempt to migrate our production to the [LTS-20.26 (GHC 9.2.8)](https://www.stackage.org/lts-20.26) accidentally discovered a significant performance downfall. 

Without the #127:

```
time                 9.187 ms   (9.138 ms .. 9.250 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 9.177 ms   (9.138 ms .. 9.229 ms)
std dev              127.5 μs   (81.71 μs .. 204.6 μs)
```

...and with it:
```
time                 1.052 s    (1.018 s .. 1.089 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.052 s    (1.044 s .. 1.061 s)
std dev              9.869 ms   (4.780 ms .. 13.74 ms)
```

Based on these (and many others, shared down below) observations, I tend to conclude that #127 performs about **~100 times slower**. 

# Long story and more fruitful pieces of evidence 

Once unwanted slowness was spotted, a drill-down analysis was carried out. Below I am going to describe every step taken and its outcome.

## Step 1: `perf` profiling

`perf record -F 33 -g -e cycles:u -- $EXE +RTS -g2 -RTS` was used to profile the executable (`$EXE`) built with the `-O 2` flag to ensure maximum optimization. Both profiles then were compared against each other. The following was observed for the slowed-down version of the code:
```
+   98,53%    98,53%  benchmark  benchmark               [.] ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc
```
 
Cleary, `ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc` stands for `instance Ord` implementation. `98,53%` stands for "98,54% of all the userspace cycles were spent executing the `ghczmprim_GHCziClasses_zdfOrdZMZNzuzdszdc`". That directed the investigation towards `RTS`-based flag instrumentation in order to be able to get a deeper insight into where these comparisons come from.

## Step 2: `RTS` flags 

`$EXE +RTS -P -RTS` produced a Haskell-native runtime profile, which then was visualized with a flame graph using the `ghc-prof-flamegraph` command line utility.

Comparing the flame graphs clearly highlighted that the [`parseLines :: HamletSettings -> String -> Result (Maybe NewlineStyle, HamletSettings, [(Int, Line)])`](https://github.com/yesodweb/shakespeare/tree/shakespeare-2.1.0/Text/Hamlet/Parse.hs#L80) started to eat up significantly more CPU time [^1].

That pointed here (to the `Parsec` itself) and its changelog, where #127 popped up as a potential root cause. Reverting the #128 only (which fixed the #127) indeed rollbacks performance back to the expected measure. Trying to figure out what's happening it was decided to dump GHC intermediate representation. 

## Step 3: `GHC` intermediate analysis

The problematic piece of code was built with the `-ddump-simpl -dsuppress-all -dno-suppress-type-signatures` and those were compared. It turned out that the misperforming version generates more code of the following pattern:

```
(case s2_aeCm of { State ds7_aeCD ds8_aeCE ds9_aeCF ->
                  case ds8_aeCE of ww10_s1l0j
                  { SourcePos ww11_s1l0k ww12_s1l0l ww13_s1l0m ->
                  case $fOrd[]_$s$ccompare1 ww4_s1l09 ww11_s1l0k of {
                    LT -> ...;
                    EQ -> ...;
                    GT -> ...;
                  }
                  }
                  })
```

and the `$fOrd[]_$s$ccompare1` method called about **8 times** more often. Unfortunately, I lack knowledge of Haskell internals and thus am unable to analyze the intermediate representation further.

## The conclusion 

To my understanding, these 3 steps advocate #127 to be the root cause. Yet I fail to explain how could that happen. As of now, a hacky hotfix allows us to migrate to the `GHC 9.2.8` avoiding the regression, however, I would like to open a discussion on how one could possibly fix it. The open questions are:

1. How much confidence do these three steps deliver? Would that be enough to consider #127 a root cause or some more steps to be taken? 
2. What would be a better way to study the output of the `-ddump-simpl -dsuppress-all -dno-suppress-type-signatures` flags and can it help to make the issues even more concrete?

# P.S. 

At the moment the code involved in instrumentation lies under an NDA. Thus, I do not share it intentionally due to legal restrictions. I hope the observations above will be sufficient for this discussion to progress. 

[^1]: The `shakespeare-2.1.0` is running in the production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The recent fix of the `(>>=)` memory leak seems to cause an enormous performance degradation. #171

Short story

Long story and more fruitful pieces of evidence

Step 1: `perf` profiling

Step 2: `RTS` flags

Step 3: `GHC` intermediate analysis

The conclusion

P.S.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The recent fix of the (>>=) memory leak seems to cause an enormous performance degradation. #171

Description

Short story

Long story and more fruitful pieces of evidence

Step 1: perf profiling

Step 2: RTS flags

Step 3: GHC intermediate analysis

The conclusion

P.S.

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The recent fix of the `(>>=)` memory leak seems to cause an enormous performance degradation. #171

Step 1: `perf` profiling

Step 2: `RTS` flags

Step 3: `GHC` intermediate analysis