-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polygon: OOM Rejected Headers #11387
Comments
This was our issue before with Erigon in version 2.59.3: #10873 |
similar to #10734 |
@MrFreezeDZ hey, to help us troubleshoot this can you please:
This branch will try to capture the heap profile when near OOM when we face the stage headers "Rejected header marked as bad" situation and will save it in When this issue re-occurs please send us that file. Or can send us a png after doing |
@MrFreezeDZ this change is now in Were you able to run Erigon with |
Hi, I just came back from holidays :) |
Recently i saw:
@taratorio so, to reproduce - you just need something like this: #11799 |
Hi @taratorio today we had the issue again. In the logs we saw the following:
We saw a lot of CPU usage, I think this is due to tryingn to write the heap dump. |
@MrFreezeDZ hey, thank you for trying it out. I've added the ability for you to specify the file path via another env var. Can you please run branch |
@taratorio thank you for the new environment variable. Right now we use the images from https://hub.docker.com/r/thorax/erigon/tags . It would be easier for me, if the variable would be in another image version. Until then I will try to find a way to use the branch in our CD. |
@MrFreezeDZ separate topic but fyi, the official Erigon docker images are no longer at thorax but at https://hub.docker.com/r/erigontech/erigon. I think I may have been able to reproduce an OOM with Alex's help. I'm investigating a fix. |
relates to #11387 (comment) port of #12400 to E2
@MrFreezeDZ I fixed a OOM - the branch is Would appreciate your help to run this if possible, otherwise this issue may be prolonged if it turns out you are suffering from a different OOM. |
…12404) relates to #11387, #11473, #10734 tried to simulate the OOM using #11799 What I found was infinitely growing alloc of headers when receiving new header messages in sentry's `blockHeaders66` handler (check screenshot below). It looks like this is happening because in the case of a bad child header: we delete it from the `links` map, however its parent link still holds a reference to it so the deleted link & header never get gc-ed. Furthermore if new similar bad hashes arrive after deletion they get appended to their parent header's link and the children of that link can grow indefinitely ([here](https://github.com/erigontech/erigon/blob/main/turbo/stages/headerdownload/header_algos.go#L1085-L1086)). Ie confirmed with debug logs (note link at 13450415 has 140124 children): ``` DBUG[10-21|18:18:05.003] [downloader] InsertHeader: Rejected header parent info hash=0xb488d67deaf4103880fa972fd72a7a9be552e3bc653f124f1ad9cb45f36bcd07 height=13450415 children=140124 ``` <br/> The solution for this is to remove the bad link from its parent child list [here](https://github.com/erigontech/erigon/blob/main/turbo/stages/headerdownload/header_algos.go#L544) so that 1) it gets gc-ed and 2) the children list does not grow indefinitely. <br/> ![oom-heap-profile2](https://github.com/user-attachments/assets/518fa658-c199-48b6-aa2d-110673264144)
relates to #11387 (comment) port of #12400 to E2
…(#11551) cherry-pick 2a98f6aa53ccd558543bc95ffe9bf0fad4ef278f for E2 relates to: erigontech/erigon#10734 erigontech/erigon#11387 restart Erigon with `SAVE_HEAP_PROFILE = true` env variable wait until we reach 45% or more alloc in stage_headers when "noProgressCounter >= 5" or "Rejected header marked as bad"
System information
Erigon version from the logs:
OS & Version: Kubernetes, Image from here: https://hub.docker.com/r/thorax/erigon
The Erigon container has resource limits of 48 CPUs and 208GiB Memory.
Erigon Command (with flags/config):
Consensus Layer:
Heimdall 1.0.7
Consensus Layer Command (with flags/config):
/usr/bin/heimdalld start --home=/heimdall-home
Chain/Network:
bor-mainnet
Expected behaviour
When Erigon realizes that a header is rejected an the same header will be present again and again, it should unwind itself some blocks, I guess.
Actual behaviour
Erigon starts spamming log messages and will use more memory then the container has as memory limit. Then Kubernetes will OOMKill the container and the Container is restarted. After Erigon's restart the same messages will occur until the next OOMKill.
Erigon logs lines similar to this with only a few milliseconds between each line until it is OOMKilled:
Steps to reproduce the behaviour
I do not know how to actively reproduce this behavior. It occurred on the last three weekends.
Backtrace
These are just the logs copied from Googles LogExplorer.
Here one can see the restart and the beginning of the "rejected header" logs.
The text was updated successfully, but these errors were encountered: