-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix laziness while streaming blocks #2532
Conversation
Great find and nice graphs to show it! Strange that we hadn't noticed that before 🤔 I actually think that the cause of this leak is not that obvious and that adding that bang is not the ideal solution, because to me, it doesn't make sense that that bang fixes it. Let me explain: We replay blocks using Note the which calls The implementation of I am convinced that there is no reason for that bang having to be there, instead, I believe (Edsko is not entirely convinced of this), that this is due to a GHC bug! We first described it in: IntersectMBO/ouroboros-consensus#741. See the following two workaround we have in place for it: And in this instance, we can apply the same workaround, i.e., remove the extra bang from this PR, add This might also explain why we hadn't noticed it before: I believe it's a bug in the strictness analyser that can be triggered by changes in the related code. So when we made some changes to that module, which shouldn't affect the evaluation of the computation, we might have accidentally triggered the bug from that moment on. Note that this is a very similar piece of code as the other two examples: we're updating a field from a record! I'm hoping that this bug is fixed in newer GHC versions 🤞 (if I want to report it, I should also reproduce it first with the most recent release). We're slowly working towards switching to a newer GHC version (e.g., IntersectMBO/cardano-prelude#121, IntersectMBO/cardano-base#157). |
I first added this bang pattern to the argument I found a very similar issue here https://www.reddit.com/r/haskell/comments/3hvrm9/a_haskell_space_leak/cubp2on |
Yeah, that's fighting the symptom. As you say, it's weird since it's a strict field.
It shouldn't matter that For example: -- | Lazy in its argument
f :: Int -> Int
f x = x + 1
-- Strict in the result: it doesn't matter that we're calling a lazy function
-- as long as forcing the result means that we evaluate all thunks in it
> let !y = f undefined in True
*** Exception: Prelude.undefined
Bingo, this confirms that the bang is not needed and fits with my theory that it's a GHC bug.
I'm not sure that's the same thing: they refer to |
Actually I compiled a new clone without any profiling flags and I don't see any big memory usage with top. This makes me think that profiling somehow affects laziness in this case. If I'm not mistaken, this shouldn't happen and is a ghc bug. I was wondering if we can do some changes to ensure we don't depend on any ghc optimizations/bugs here, since the result can be pretty ugly. |
Ideally, yes, but I don't know which changes that would be. It would be great if a newer GHC version just fixes it, then we wouldn't need to do anything. In the meantime, I suggest we do what I have done for the other two instances of this issue: add an INLINE pragma and a comment. |
81d4107
to
6aba584
Compare
bors merge |
Yikes :( It has unfortunately been true for many years that compiling with profiling can interfere with some optimizations. (My general assumption is that the special nodes that profiling adds to the AST spoil some tree patterns that would otherwise trigger various optimizations.) Edit: SPJ said "Profiling disables all sorts of optimisations. Think of it as compiling without -O." on a ticket 9 years ago. I didn't easily find an obvious "profiling disables optimizations" GHC ticket :( -- in particular, I haven't found a list of optimizations that usually suffer. The description of http://hackage.haskell.org/package/ghc-events-analyze alludes to this.
So if I you find a similar enormous space-leak, it may be worthwhile to first attempt to reproduce similar stats without |
To make matters worse my initial graphs from the .hp format file were kinda distorted towards the left compared to my first graph above https://user-images.githubusercontent.com/11467473/90975548-d951bd80-e53d-11ea-96c5-4809c0cee53e.png which is from the ghc eventlog. This satanically made the peak appear within a couple seconds of the epoch boundary. I'm not sure why the .hp format file was wrong, my wild guess is that some time dedicated to profiling is subtracted from the real time, trying to make it more realistic. |
2548: Port #2525 and #2531 to master r=mrBliss a=mrBliss #2525 and #2531 were merged in the https://github.com/input-output-hk/ouroboros-network/tree/coot/connection-manager. However, that branch won't be merged for another few weeks. In the meantime, the merge conflicts with the changes made in those two PRs will accumulate. For example, I know that #2546 will cause a merge conflict. Now that @coot has kindly merged #2541 (which those PRs depended on) in master, we can port #2525 and #2532 to master so we can avoid the upcoming merge conflicts. After this PR is merged, the `coot/connection-manager` branch should be rebased onto master. Co-authored-by: Thomas Winant <thomas@well-typed.com> Co-authored-by: Nicolas Frisby <nick.frisby@iohk.io>
2532: Fix laziness while streaming blocks r=mrBliss a=kderme Currently we noticed a big space leak, where ghc profiling indicated a peak of >2GB PINNED memory used and ps/top reported >4GB memory used. ![img1](https://user-images.githubusercontent.com/11467473/90975548-d951bd80-e53d-11ea-96c5-4809c0cee53e.png) Checking with noUnexpectedThunks in the `LedgerDB` this was reported: ``` "ExtLedgerState (HardForkBlock (': * ByronBlock (': * (ShelleyBlock TPraosStandardCrypto) ('[] *))))","LedgerDB"], unexpectedThunkCallStack = [("$dmnoUnexpectedThunks",SrcLoc {srcLocPackage = "ouroboros-consensus-0.1.0.0-inplace", srcLocModule = "Ouroboros.Consensus.Ledger.Extended", srcLocFile = "src/Ouroboros/Consensus/Ledger/Extended.hs", srcLocStartLine = 76, srcLocStartCol = 10, srcLocEndLine = 76, srcLocEndCol = 79}),("noUnexpectedThunks",SrcLoc {srcLocPackage = "cardano-prelude-0.1.0.0-inplace", srcLocModule = "Cardano.Prelude.GHC.Heap.NormalForm.Classy", srcLocFile = "src/Cardano/Prelude/GHC/Heap/NormalForm/Classy.hs", srcLocStartLine = 424, srcLocStartCol = 44, srcLocEndLine = 424, srcLocEndCol = 70})], unexpectedThunkClosure = Just "ThunkClosure {info = StgInfoTable {entry = Nothing, ptrs = 4, nptrs = 0, tipe = THUNK, srtlen = 81781936, code = Nothing}, ptrArgs = [0x000000420daeea98,0x000000420daeeb10,0x00000042119f8c68,0x000000420250cb38], dataArgs = []}"} ``` The leak appears when we open the ChainDB and stream blocks from the Immutable DB to the Ledger DB for validation. I noticed that the older the ledger snapshot we start, the bigger the memory peak. What actually happens is that because of laziness the validation is basically split in two parts: - read every block from the db leading to a big memory peak and do all other actions like logging - apply each block to the ledger db, which is a pure procedure. Only after the ledger DB application starts, the gc is free to collect the blocks and the PINNED memory starts to decrease after the peak. This is more apparent on this graph ![img2](https://user-images.githubusercontent.com/11467473/90975697-4d409580-e53f-11ea-9952-7ebc06fb1595.png) `compute`, which does all the ledger transitions fires up only after all PINNED memory is allocated. The fix forces the ledger transitions to happen strictly. The new memory result: ![img3](https://user-images.githubusercontent.com/11467473/90975761-c5a75680-e53f-11ea-9c41-4844e8b2f4b1.png) The PINNED memory is quickly garbage collected and never goes above 12MB: ![img4](https://user-images.githubusercontent.com/11467473/90975812-2fbffb80-e540-11ea-9e0c-e7aa99ef93c1.png) I'll come back with some more details on this. Co-authored-by: kderme <k.dermenz@gmail.com>
Currently we noticed a big space leak, where ghc profiling indicated a peak of >2GB PINNED memory used and ps/top reported >4GB memory used.
Checking with noUnexpectedThunks in the
LedgerDB
this was reported:The leak appears when we open the ChainDB and stream blocks from the Immutable DB to the Ledger DB for validation. I noticed that the older the ledger snapshot we start, the bigger the memory peak.
What actually happens is that because of laziness the validation is basically split in two parts:
Only after the ledger DB application starts, the gc is free to collect the blocks and the PINNED memory starts to decrease after the peak. This is more apparent on this graph
compute
, which does all the ledger transitions fires up only after all PINNED memory is allocated.The fix forces the ledger transitions to happen strictly. The new memory result:
The PINNED memory is quickly garbage collected and never goes above 12MB:
I'll come back with some more details on this.