Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polygon: OOM Rejected Headers #11387

Closed
MrFreezeDZ opened this issue Jul 29, 2024 · 12 comments
Closed

Polygon: OOM Rejected Headers #11387

MrFreezeDZ opened this issue Jul 29, 2024 · 12 comments
Assignees
Labels
imp1 High importance polygon

Comments

@MrFreezeDZ
Copy link

System information

Erigon version from the logs:

{
git_branch: "heads/v2.60.1"
git_commit: "9471c44e04f43a735dc9f40b74f552d706d05f78"
git_tag: "v2.60.1-dirty"
lvl: "info"
msg: "Build info"
t: "2024-07-29T02:30:51.285312853Z"
}

OS & Version: Kubernetes, Image from here: https://hub.docker.com/r/thorax/erigon
The Erigon container has resource limits of 48 CPUs and 208GiB Memory.

Erigon Command (with flags/config):

erigon --chain=bor-mainnet
--datadir=/data/
--log.json=true
--http.addr=0.0.0.0
--http.vhosts=*
--http
--http.api=eth,admin,debug,net,trace,web3,erigon,txpool
--ws
--bor.heimdall=http://heimdallrest
--authrpc.vhosts=*
--authrpc.jwtsecret=/secret/jwt.hex
--authrpc.addr=0.0.0.0
--db.size.limit=7TB
--db.pagesize=16k
--metrics
--metrics.addr=0.0.0.0
--prune.h.before=48896162
--prune.r.before=48896162
--prune.t.before=48896162
--prune.c.before=48896162
--maxpeers=500
--torrent.download.rate=300mb
--bootnodes=enode://bdcd4786a616a853b8a041f53496d853c68d99d54ff305615cd91c03cd56895e0a7f6e9f35dbf89131044e2114a9a782b792b5661e3aff07faf125a98606a071@43.200.206.40:30303,enode://209aaf7ed549cf4a5700fd833da25413f80a1248bd3aa7fe2a87203e3f7b236dd729579e5c8df61c97bf508281bae4969d6de76a7393bcbd04a0af70270333b3@54.216.248.9:30303

Consensus Layer:
Heimdall 1.0.7

Consensus Layer Command (with flags/config):
/usr/bin/heimdalld start --home=/heimdall-home

Chain/Network:
bor-mainnet

Expected behaviour

When Erigon realizes that a header is rejected an the same header will be present again and again, it should unwind itself some blocks, I guess.

Actual behaviour

Erigon starts spamming log messages and will use more memory then the container has as memory limit. Then Kubernetes will OOMKill the container and the Container is restarted. After Erigon's restart the same messages will occur until the next OOMKill.
Erigon logs lines similar to this with only a few milliseconds between each line until it is OOMKilled:

{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:58.985590688Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:58.99246926Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:58.994019486Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:58.99991692Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.001144471Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.002410234Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.003713997Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.004939701Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.007611193Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.025658567Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.033892633Z}
{hash: 0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e, height: 59891025, lvl: warn, msg: [downloader] Rejected header marked as bad, t: 2024-07-29T07:57:59.034955129Z}

Steps to reproduce the behaviour

I do not know how to actively reproduce this behavior. It occurred on the last three weekends.

Backtrace

These are just the logs copied from Googles LogExplorer.
Here one can see the restart and the beginning of the "rejected header" logs.

ERROR 2024-07-29T02:30:51.285192185Z {"file prefix":"erigon", "json":false, "log dir":"/data/logs", "log level":"info", "lvl":"info", "msg":"logging to file system", "t":"2024-07-29T02:30:51.284924089Z"}
ERROR 2024-07-29T02:30:51.285360559Z {"lvl":"info", "msg":"Enabling metrics export to prometheus", "path":"http://0.0.0.0:6060/debug/metrics/prometheus", "t":"2024-07-29T02:30:51.285281497Z"}
ERROR 2024-07-29T02:30:51.285380361Z {"git_branch":"heads/v2.60.1", "git_commit":"9471c44e04f43a735dc9f40b74f552d706d05f78", "git_tag":"v2.60.1-dirty", "lvl":"info", "msg":"Build info", "t":"2024-07-29T02:30:51.285312853Z"}
ERROR 2024-07-29T02:30:51.285388465Z {"lvl":"info", "msg":"Starting Erigon on Bor Mainnet...", "t":"2024-07-29T02:30:51.285341217Z"}
ERROR 2024-07-29T02:30:51.294152120Z {"ETH":500, "lvl":"info", "msg":"Maximum peer count", "t":"2024-07-29T02:30:51.294044065Z", "total":500}
ERROR 2024-07-29T02:30:51.294231859Z {"APIs":"eth,admin,debug,net,trace,web3,erigon,txpool", "lvl":"info", "msg":"starting HTTP APIs", "port":8545, "t":"2024-07-29T02:30:51.294176693Z"}
DEFAULT 2024-07-29T02:30:51.294631252Z {"level":"WRN", "lvl":"info", "msg":"torrent verbosity", "t":"2024-07-29T02:30:51.294577647Z"}
ERROR 2024-07-29T02:30:53.398830626Z {"cap":5.0E7, "lvl":"info", "msg":"Set global gas cap", "t":"2024-07-29T02:30:53.398566491Z"}
ERROR 2024-07-29T02:30:53.400168956Z {"download.rate":"300mb", "ipv4-enabled":true, "ipv6-enabled":true, "lvl":"info", "msg":"[Downloader] Running with", "t":"2024-07-29T02:30:53.40006929Z", "upload.rate":"4mb"}
ERROR 2024-07-29T02:30:53.402730340Z {"label":"chaindata", "lvl":"info", "msg":"Opening Database", "path":"/data/chaindata", "t":"2024-07-29T02:30:53.40262083Z"}
ERROR 2024-07-29T02:30:53.413283853Z {"label":"chaindata", "lvl":"info", "msg":"[db] open", "pageSize":16384, "sizeLimit":"7TB", "t":"2024-07-29T02:30:53.413165095Z"}
ERROR 2024-07-29T02:30:53.445312770Z {"lvl":"warn", "msg":"Sanitizing invalid bor miner gas price", "provided":"1000000000", "t":"2024-07-29T02:30:53.445187632Z", "updated":"30000000000"}
ERROR 2024-07-29T02:30:53.445325789Z {"lvl":"warn", "msg":"Sanitizing invalid bor min fee cap", "provided":1, "t":"2024-07-29T02:30:53.445247681Z", "updated":3.0E10}
ERROR 2024-07-29T02:30:53.445390269Z {"config":"{ChainID: 137, Homestead: 0, DAO: <nil>, Tangerine Whistle: 0, Spurious Dragon: 0, Byzantium: 0, Constantinople: 0, Petersburg: 0, Istanbul: 3395000, Muir Glacier: 3395000, Berlin: 14750000, London: 23850000, Arrow Glacier: <nil>, Gray Glacier: <nil>, Terminal Total Difficulty: <nil>, Merge Netsplit: <nil>, Shanghai: <nil>, Cancun: <nil>, Prague: <nil>, Osaka: <nil>, Engine: bor, NoPruneContracts: map[]}", "genesis":"0xa9c28ce2141b56c474f1dc504bee9b01eb1bd7d1a507580d5519d4437a97de1b", "lvl":"info", "msg":"Initialised chain configuration", "t":"2024-07-29T02:30:53.445300789Z"}
ERROR 2024-07-29T02:31:47.820764791Z {"lvl":"info", "msg":"Initialising Ethereum protocol", "network":137, "t":"2024-07-29T02:31:47.820580651Z"}
ERROR 2024-07-29T02:31:47.820787601Z {"label":"bor", "lvl":"info", "msg":"Opening Database", "path":"/data/bor", "t":"2024-07-29T02:31:47.820691593Z"}
ERROR 2024-07-29T02:31:55.694311894Z mdbx_setup_dxb:16114 opening after an unclean shutdown, but boot-id(847d4bd58b8c010e-ac661d66623646f0) is MATCH: rollback NOT needed, steady-sync NEEDED
ERROR 2024-07-29T02:31:55.697926161Z {"lvl":"info", "msg":"Starting private RPC server", "on":"127.0.0.1:9090", "t":"2024-07-29T02:31:55.6978114Z"}
ERROR 2024-07-29T02:31:55.698055581Z {"lvl":"info", "msg":"new subscription to logs established", "t":"2024-07-29T02:31:55.69792753Z"}
ERROR 2024-07-29T02:31:55.698176089Z {"lvl":"info", "msg":"rpc filters: subscribing to Erigon events", "t":"2024-07-29T02:31:55.698102688Z"}
ERROR 2024-07-29T02:31:55.698284592Z {"lvl":"info", "msg":"New txs subscriber joined", "t":"2024-07-29T02:31:55.698193201Z"}
ERROR 2024-07-29T02:31:55.698316791Z {"lvl":"info", "msg":"new subscription to newHeaders established", "t":"2024-07-29T02:31:55.698262819Z"}
ERROR 2024-07-29T02:31:55.699142531Z {"grpc":false, "http.url":"[::]:8545", "lvl":"info", "msg":"JsonRpc endpoint opened", "t":"2024-07-29T02:31:55.699027753Z", "websocket.url":"[::]:8546", "ws":true, "ws.compression":true}
ERROR 2024-07-29T02:31:55.704554504Z {"lvl":"info", "msg":"[txpool] Started", "t":"2024-07-29T02:31:55.704457876Z"}
ERROR 2024-07-29T02:31:55.707848100Z mdbx_setup_dxb:16114 opening after an unclean shutdown, but boot-id(847d4bd58b8c010e-ac661d66623646f0) is MATCH: rollback NOT needed, steady-sync NEEDED
ERROR 2024-07-29T02:31:55.713259023Z {"err":"no UPnP or NAT-PMP router discovered", "lvl":"warn", "msg":"NAT ExternalIP resolution has failed, try to pass a different --nat option", "t":"2024-07-29T02:31:55.713128377Z"}
ERROR 2024-07-29T02:31:55.714388499Z {"lvl":"info", "msg":"Started P2P networking", "name":"erigon/v2.60.1-9471c44e/linux-amd64/go1.21.5", "self":"enode://6b7469e82e1699591029807cdce3cd372098340f122de38c912cfeb8ff2a35839cb22d99f6593998fadf4fa733e4c12b104367894506bcf55bf9bb8cf815c469@127.0.0.1:30304", "t":"2024-07-29T02:31:55.714288741Z", "version":67}
ERROR 2024-07-29T02:31:55.715816455Z mdbx_setup_dxb:16114 opening after an unclean shutdown, but boot-id(847d4bd58b8c010e-ac661d66623646f0) is MATCH: rollback NOT needed, steady-sync NEEDED
ERROR 2024-07-29T02:31:55.721069959Z {"err":"no UPnP or NAT-PMP router discovered", "lvl":"warn", "msg":"NAT ExternalIP resolution has failed, try to pass a different --nat option", "t":"2024-07-29T02:31:55.720950266Z"}
ERROR 2024-07-29T02:31:55.722180434Z {"lvl":"info", "msg":"Started P2P networking", "name":"erigon/v2.60.1-9471c44e/linux-amd64/go1.21.5", "self":"enode://6b7469e82e1699591029807cdce3cd372098340f122de38c912cfeb8ff2a35839cb22d99f6593998fadf4fa733e4c12b104367894506bcf55bf9bb8cf815c469@127.0.0.1:30303", "t":"2024-07-29T02:31:55.722074372Z", "version":68}
ERROR 2024-07-29T02:31:55.760600630Z {"lvl":"info", "msg":"[1/15 Snapshots] Requesting downloads", "t":"2024-07-29T02:31:55.760444836Z"}
ERROR 2024-07-29T02:31:59.071241522Z {"alloc":"4.3GB", "blocks":"59800k", "indices":"59800k", "lvl":"info", "msg":"[snapshots:download] Stat", "sys":"6.9GB", "t":"2024-07-29T02:31:59.071057632Z"}
ERROR 2024-07-29T02:31:59.071829784Z {"from":5.9891024E7, "lvl":"info", "msg":"[2/15 Headers] Waiting for headers...", "t":"2024-07-29T02:31:59.071719647Z"}
ERROR 2024-07-29T02:32:15.544290838Z {"age":"8s", "blk/sec":2516, "headers":41460, "highest":5.9932484E7, "in":16.472392297, "lvl":"info", "msg":"[2/15 Headers] Processed", "t":"2024-07-29T02:32:15.544109115Z"}
ERROR 2024-07-29T02:32:15.545458573Z {"from":9359, "lvl":"info", "msg":"[3/15 BorHeimdall] Processing spans...", "t":"2024-07-29T02:32:15.54535721Z", "to":9365}
ERROR 2024-07-29T02:32:15.578701931Z {"from":5.9891025E7, "lvl":"info", "msg":"[3/15 BorHeimdall] Processing sync events...", "t":"2024-07-29T02:32:15.578537126Z", "to":5.9932484E7}
ERROR 2024-07-29T02:32:19.369843815Z {"fetch time":"1.475624082s", "lastCheckpointId":0, "lastMilestoneId":0, "lastSpanID":9365, "lastStateSyncEventID":2940431, "lvl":"info", "msg":"[3/15 BorHeimdall] Sync events processed", "process time":"3.825342255s", "progress":5.9932484E7, "snap time":"3.189122541s", "sync event time":"1.70059327s", "t":"2024-07-29T02:32:19.369585203Z", "total records":820, "waypoint time":"0s"}
ERROR 2024-07-29T02:32:49.414018597Z {"current_prefix":"937c8bf5", "into":"HeaderNumber", "lvl":"info", "msg":"[4/15 BlockHashes] ETL [2/2] Loading", "t":"2024-07-29T02:32:49.413873177Z"}
ERROR 2024-07-29T02:33:11.836453400Z {"from":5.9891024E7, "lvl":"info", "msg":"[5/15 Bodies] Processing bodies...", "t":"2024-07-29T02:33:11.836264264Z", "to":5.9932484E7}
ERROR 2024-07-29T02:33:35.732227721Z {"end":5.9930789E7, "lvl":"info", "msg":"[bor.heimdall] Got new checkpoint", "rootHash":"0x6a8f8c2e5c99586cbfc7d182d37ba49e16b47f3ef209301ffe1061f3eb7ec8a3", "start":5.9930022E7, "t":"2024-07-29T02:33:35.732089634Z"}
ERROR 2024-07-29T02:34:47.801928583Z {"eth67":"3", "eth68":"15", "lvl":"info", "msg":"[p2p] GoodPeers", "t":"2024-07-29T02:34:47.801694009Z"}
ERROR 2024-07-29T02:34:48.139835525Z {"Anonymous":"8.3GB", "PrivateClean":"4.0GB", "PrivateDirty":"8.3GB", "Pss":"12.3GB", "Referenced":"12.2GB", "Rss":"12.3GB", "SharedClean":"4.0KB", "SharedDirty":"0B", "Size":"0B", "Swap":"0B", "alloc":"4.9GB", "lvl":"info", "msg":"[mem] memory stats", "sys":"8.2GB", "t":"2024-07-29T02:34:48.139641879Z"}
ERROR 2024-07-29T02:34:55.699178871Z {"alloc":"5.4GB", "baseFee":0, "lvl":"info", "msg":"[txpool] stat", "pending":0, "queued":7670, "sys":"8.2GB", "t":"2024-07-29T02:34:55.698978804Z"}
ERROR 2024-07-29T02:35:15.720206133Z {"end":5.9930789E7, "lvl":"info", "msg":"[bor.heimdall] Got new checkpoint", "rootHash":"0x6a8f8c2e5c99586cbfc7d182d37ba49e16b47f3ef209301ffe1061f3eb7ec8a3", "start":5.9930022E7, "t":"2024-07-29T02:35:15.720051854Z"}
ERROR 2024-07-29T02:35:57.280200680Z {"alloc":"5.5GB", "block_num":5.9891024E7, "lvl":"info", "msg":"[5/15 Bodies] Writing block bodies", "remaining":41460, "sys":"8.3GB", "t":"2024-07-29T02:35:57.27997419Z"}
ERROR 2024-07-29T02:36:12.883640945Z {"alloc":"4.6GB", "block_num":5.9901335E7, "lvl":"info", "msg":"[5/15 Bodies] Writing block bodies", "remaining":31149, "sys":"8.6GB", "t":"2024-07-29T02:36:12.883419322Z"}
ERROR 2024-07-29T02:36:41.772258159Z {"blk/sec":197, "blocks":41460, "highest":5.9932484E7, "in":209.935968206, "lvl":"info", "msg":"[5/15 Bodies] Processed", "t":"2024-07-29T02:36:41.772054037Z"}
ERROR 2024-07-29T02:36:41.772295196Z {"in":"3m29.936104521s", "lvl":"info", "msg":"[5/15 Bodies] DONE", "t":"2024-07-29T02:36:41.772142299Z"}
ERROR 2024-07-29T02:36:41.772379103Z {"from":5.9891024E7, "lvl":"info", "msg":"[6/15 Senders] Started", "t":"2024-07-29T02:36:41.772295035Z", "to":5.9932484E7}
ERROR 2024-07-29T02:36:42.074922778Z {"from":5.9891024E7, "lvl":"info", "msg":"[7/15 Execution] Blocks execution", "t":"2024-07-29T02:36:42.074742392Z", "to":5.9932484E7}
ERROR 2024-07-29T02:36:44.079172106Z {"block":5.9891025E7, "err":"invalid block: mismatched receipt headers for block 59891025 (0xf273f745cdbb0d48e208d3864d838130531770319fb0160996a0159dbd17da81 != 0x42a0be310674e0f5dbdf6afa6726b8eda1a19bb97bb335e38089d01c2168d246)", "hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "lvl":"warn", "msg":"[7/15 Execution] Execution failed", "t":"2024-07-29T02:36:44.078968458Z"}
ERROR 2024-07-29T02:36:44.084391968Z {"block":5.9891024E7, "lvl":"info", "msg":"[7/15 Execution] Completed on", "t":"2024-07-29T02:36:44.084265281Z"}
ERROR 2024-07-29T02:36:49.174644419Z {"amount":1024, "from":5.9889967E7, "hash":"0xfc1ea0f5adbc25d9962e454091007cd845cf4c76a349b1479391acbb2624fec0", "header sending":"7.418631ms", "log sending":"287ns", "lvl":"info", "msg":"RPC Daemon notified of new headers", "t":"2024-07-29T02:36:49.174440316Z", "to":5.9890991E7}
ERROR 2024-07-29T02:36:49.175663131Z {"BlockHashes":"52.466s", "Bodies":"3m29.936s", "BorHeimdall":"3.825s", "Execution":"2.009s", "Headers":"16.472s", "Senders":"302ms", "Snapshots":"3.361s", "Unwind Headers":"5.061s", "lvl":"info", "msg":"Timings (slower than 50ms)", "t":"2024-07-29T02:36:49.175550806Z"}
ERROR 2024-07-29T02:36:49.198690629Z {"from":5.9891024E7, "lvl":"info", "msg":"[2/15 Headers] Waiting for headers...", "t":"2024-07-29T02:36:49.198541947Z"}
ERROR 2024-07-29T02:36:49.261542730Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:49.261358825Z"}
ERROR 2024-07-29T02:36:49.261966448Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:49.261890351Z"}
ERROR 2024-07-29T02:36:49.328662150Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:49.32846549Z"}
ERROR 2024-07-29T02:36:49.386449225Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:49.3862906Z"}
ERROR 2024-07-29T02:36:49.790972966Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:49.790800634Z"}
ERROR 2024-07-29T02:36:49.911756960Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:49.911611888Z"}
ERROR 2024-07-29T02:36:50.206767534Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:50.206576862Z"}
ERROR 2024-07-29T02:36:50.206821975Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:50.206764102Z"}
ERROR 2024-07-29T02:36:50.219485885Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:50.219392892Z"}
ERROR 2024-07-29T02:36:50.230286687Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:50.2301451Z"}
ERROR 2024-07-29T02:36:50.234570253Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:50.234439624Z"}
ERROR 2024-07-29T02:36:50.563872297Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:50.56371213Z"}
ERROR 2024-07-29T02:36:51.207357544Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:51.207194009Z"}
ERROR 2024-07-29T02:36:51.207440633Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:51.207378046Z"}
ERROR 2024-07-29T02:36:51.207823654Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:51.207755826Z"}
ERROR 2024-07-29T02:36:51.209211246Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:51.209125422Z"}
ERROR 2024-07-29T02:36:51.209293379Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:51.209239268Z"}
ERROR 2024-07-29T02:36:51.303280502Z {"hash":"0x6035a4866ff16fd8079640701ea82230dcce59ccd7259f69981c26eb85ebe62e", "height":5.9891025E7, "lvl":"warn", "msg":"[downloader] Rejected header marked as bad", "t":"2024-07-29T02:36:51.30313245Z"}

@MrFreezeDZ
Copy link
Author

This was our issue before with Erigon in version 2.59.3: #10873

@AskAlexSharov
Copy link
Collaborator

similar to #10734

taratorio added a commit that referenced this issue Aug 9, 2024
relates to:
#10734
#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable
wait until we reach 45% or more alloc in stage_headers when
"noProgressCounter >= 5" or "Rejected header marked as bad"
taratorio added a commit that referenced this issue Aug 9, 2024
relates to:
#10734
#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable
wait until we reach 45% or more alloc in stage_headers when
"noProgressCounter >= 5" or "Rejected header marked as bad"
@taratorio
Copy link
Member

taratorio commented Aug 9, 2024

@MrFreezeDZ hey, to help us troubleshoot this can you please:

  • stop Erigon
  • git checkout e2-polgyon-headers-stage-oom
  • make erigon
  • set env variable SAVE_HEAP_PROFILE=true and start Erigon: export SAVE_HEAP_PROFILE=true && ./build/bin/erigon ...

This branch will try to capture the heap profile when near OOM when we face the stage headers "Rejected header marked as bad" situation and will save it in $TMPDIR/erigon-mem.prof. You can do echo $TMPDIR to find your OS tmpdir.

When this issue re-occurs please send us that file. Or can send us a png after doing go tool pprof -png $TMPDIR/erigon-mem.prof > erigon-mem-prof.png

taratorio added a commit that referenced this issue Aug 12, 2024
…11551)

cherry-pick 2a98f6a for E2

relates to:
#10734
#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable wait until
we reach 45% or more alloc in stage_headers when "noProgressCounter >=
5" or "Rejected header marked as bad"
@AskAlexSharov AskAlexSharov added imp2 Medium importance imp1 High importance and removed imp2 Medium importance labels Aug 16, 2024
@taratorio
Copy link
Member

@MrFreezeDZ this change is now in v2.60.6 so you can update to that and won't need to use the branch I mentioned in my previous message.

Were you able to run Erigon with export SAVE_HEAP_PROFILE=true && ./build/bin/ergion and reproduce the OOM again?

@MrFreezeDZ
Copy link
Author

Hi, I just came back from holidays :)
Right now we were not able to reproduce the behavior with the enabled flag. Just to be clear, we are not able to actively reproduce the behavior. We just have the problem from time to time. We had it four weeks in a row on the weekends. Then we had two weeks without any problem and on the last weekend it happened again.
I will use the environment variable SAVE_HEAP_PROFILE=true in our next deployment with version 2.60.6, but this will take some time.

@AskAlexSharov
Copy link
Collaborator

Recently i saw:

[WARN] [08-30|02:08:46.608] [7/9 Execution] Execution failed         block=61101954 txNum=4495312603 hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 err="invalid block, txnIdx=59, gas used by execution: 10771434, in header: 18160412, headerNum=61101954, 24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413"
[INFO] [08-30|02:08:46.608] [7/9 Execution] Done                     blk=61101953 blks=1 blk/s=7.0 txs=104 tx/s=729 gas/s=123.37M buf=114.8KB/2.0GB stepsInDB=0.00 step=2877.0 alloc=22.3GB sys=34.5GB
[EROR] [08-30|02:08:46.640] Staged Sync                              err="bad block unwinding"
[INFO] [08-30|02:08:47.141] [2/9 Headers] Waiting for headers...     from=61101953 hash=0xab87f6a3a5a1c8c400a22c9fef0f0c6c54e99764f496b32768af78c7661f00bb
[WARN] [08-30|02:08:47.382] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.383] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.383] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.407] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.423] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.425] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.451] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.465] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.517] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.551] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.571] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.571] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954
[WARN] [08-30|02:08:47.572] [downloader] InsertHeader: Rejected header marked as bad hash=0x24079d49bc63a8dfed171ec5d26f381ad5e7d08849e0eed51eabcf6caa5d8413 height=61101954

@taratorio so, to reproduce - you just need something like this: #11799

@MrFreezeDZ
Copy link
Author

Hi @taratorio today we had the issue again. In the logs we saw the following:

ERROR 2024-09-30T06:54:35.598384885Z {"alloc":"168.6GB", "lvl":"info", "msg":"[Experiment] heap profile threshold check", "t":"2024-09-30T06:54:35.598275175Z", "total":"208.0GB"}
ERROR 2024-09-30T06:54:35.598412555Z {"filePath":"/tmp/erigon-mem.prof", "lvl":"info", "msg":"[Experiment] saving heap profile as near OOM", "t":"2024-09-30T06:54:35.598324615Z"}
ERROR 2024-09-30T06:54:47.487982999Z {"alloc":"168.8GB", "lvl":"info", "msg":"[Experiment] heap profile threshold check", "t":"2024-09-30T06:54:47.487621929Z", "total":"208.0GB"}
ERROR 2024-09-30T06:54:47.488023249Z {"filePath":"/tmp/erigon-mem.prof", "lvl":"info", "msg":"[Experiment] saving heap profile as near OOM", "t":"2024-09-30T06:54:47.487768339Z"}
ERROR 2024-09-30T06:54:55.891238310Z {"eth67":"6", "eth68":"4", "lvl":"info", "msg":"[p2p] GoodPeers", "t":"2024-09-30T06:54:55.89091132Z"}
ERROR 2024-09-30T06:54:59.291905173Z {"hash":"0xac3a9d14df1296c422fc0ad1f7fbf3e9ae4b57ccc439a53ca9481844c266b297", "height":6.2401617E7, "lvl":"warn", "msg":"[downloader] InsertHeader: Rejected header marked as bad", "t":"2024-09-30T06:54:59.291645543Z"}
ERROR 2024-09-30T06:54:59.292586403Z {"alloc":"168.6GB", "blk/second":0, "invalidHeaders":0, "lvl":"info", "msg":"[2/15 Headers] No block headers to write in this log period", "number":6.2401616E7, "rejectedBadHeaders":1, "sys":"184.3GB", "t":"2024-09-30T06:54:59.292431073Z"}
ERROR 2024-09-30T06:54:59.292617883Z {"dups":25423, "lvl":"info", "msg":"Req/resp stats", "req":69, "reqMax":6.2451115E7, "reqMin":6.240166E7, "resp":286, "respMax":6.2451117E7, "respMin":6.2401615E7, "skel":1, "skelMax":6.2401807E7, "skelMin":6.2401615E7, "t":"2024-09-30T06:54:59.292504263Z"}
ERROR 2024-09-30T06:54:59.294352983Z {"alloc":"168.6GB", "lvl":"info", "msg":"[Experiment] heap profile threshold check", "t":"2024-09-30T06:54:59.294202243Z", "total":"208.0GB"}
ERROR 2024-09-30T06:54:59.294379763Z {"filePath":"/tmp/erigon-mem.prof", "lvl":"info", "msg":"[Experiment] saving heap profile as near OOM", "t":"2024-09-30T06:54:59.294253833Z"}
ERROR 2024-09-30T06:55:00.062290850Z {"Anonymous":"180.5GB", "PrivateClean":"8.9GB", "PrivateDirty":"180.5GB", "Pss":"189.4GB", "Referenced":"189.1GB", "Rss":"189.4GB", "SharedClean":"640.0KB", "SharedDirty":"0B", "Size":"0B", "Swap":"0B", "alloc":"168.6GB", "lvl":"info", "msg":"[mem] memory stats", "sys":"184.3GB", "t":"2024-09-30T06:55:00.061878521Z"}
ERROR 2024-09-30T06:55:09.673261692Z {"alloc":"169.0GB", "baseFee":0, "lvl":"info", "msg":"[txpool] stat", "pending":0, "queued":30000, "sys":"184.3GB", "t":"2024-09-30T06:55:09.672924482Z"}
ERROR 2024-09-30T06:55:11.751003677Z {"alloc":"169.2GB", "lvl":"info", "msg":"[Experiment] heap profile threshold check", "t":"2024-09-30T06:55:11.750810677Z", "total":"208.0GB"}
ERROR 2024-09-30T06:55:11.751041728Z {"filePath":"/tmp/erigon-mem.prof", "lvl":"info", "msg":"[Experiment] saving heap profile as near OOM", "t":"2024-09-30T06:55:11.750890437Z"}
ERROR 2024-09-30T06:55:23.502408482Z {"hash":"0xac3a9d14df1296c422fc0ad1f7fbf3e9ae4b57ccc439a53ca9481844c266b297", "height":6.2401617E7, "lvl":"warn", "msg":"[downloader] InsertHeader: Rejected header marked as bad", "t":"2024-09-30T06:55:23.502196133Z"}
ERROR 2024-09-30T06:55:23.513039673Z {"alloc":"168.7GB", "lvl":"info", "msg":"[Experiment] heap profile threshold check", "t":"2024-09-30T06:55:23.512861493Z", "total":"208.0GB"}
ERROR 2024-09-30T06:55:23.513079242Z {"filePath":"/tmp/erigon-mem.prof", "lvl":"info", "msg":"[Experiment] saving heap profile as near OOM", "t":"2024-09-30T06:55:23.512993223Z"}
ERROR 2024-09-30T06:55:35.727652341Z {"hash":"0xac3a9d14df1296c422fc0ad1f7fbf3e9ae4b57ccc439a53ca9481844c266b297", "height":6.2401617E7, "lvl":"warn", "msg":"[downloader] InsertHeader: Rejected header marked as bad", "t":"2024-09-30T06:55:35.727421741Z"}
ERROR 2024-09-30T06:55:35.728600471Z {"alloc":"168.7GB", "blk/second":0, "invalidHeaders":0, "lvl":"info", "msg":"[2/15 Headers] No block headers to write in this log period", "number":6.2401616E7, "rejectedBadHeaders":2, "sys":"184.3GB", "t":"2024-09-30T06:55:35.728457141Z"}
ERROR 2024-09-30T06:55:35.728686081Z {"dups":73550, "lvl":"info", "msg":"Req/resp stats", "req":246, "reqMax":6.2451133E7, "reqMin":6.2428055E7, "resp":755, "respMax":6.2451134E7, "respMin":6.2401615E7, "skel":2, "skelMax":6.2401807E7, "skelMin":6.2401615E7, "t":"2024-09-30T06:55:35.728543541Z"}
ERROR 2024-09-30T06:55:35.730583621Z {"alloc":"168.7GB", "lvl":"info", "msg":"[Experiment] heap profile threshold check", "t":"2024-09-30T06:55:35.730407511Z", "total":"208.0GB"}
ERROR 2024-09-30T06:55:35.730638401Z {"filePath":"/tmp/erigon-mem.prof", "lvl":"info", "msg":"[Experiment] saving heap profile as near OOM", "t":"2024-09-30T06:55:35.730467001Z"}
ERROR 2024-09-30T06:55:48.583071989Z {"alloc":"169.7GB", "lvl":"info", "msg":"[Experiment] heap profile threshold check", "t":"2024-09-30T06:55:48.582827469Z", "total":"208.0GB"}
ERROR 2024-09-30T06:55:48.583159469Z {"filePath":"/tmp/erigon-mem.prof", "lvl":"info", "msg":"[Experiment] saving heap profile as near OOM", "t":"2024-09-30T06:55:48.582956519Z"}
ERROR 2024-09-30T06:56:00.830055237Z {"hash":"0xac3a9d14df1296c422fc0ad1f7fbf3e9ae4b57ccc439a53ca9481844c266b297", "height":6.2401617E7, "lvl":"warn", "msg":"[downloader] InsertHeader: Rejected header marked as bad", "t":"2024-09-30T06:56:00.829717057Z"}
ERROR 2024-09-30T06:56:00.831284317Z {"alloc":"168.7GB", "blk/second":0, "invalidHeaders":0, "lvl":"info", "msg":"[2/15 Headers] No block headers to write in this log period", "number":6.2401616E7, "rejectedBadHeaders":1, "sys":"184.5GB", "t":"2024-09-30T06:56:00.830906697Z"}
ERROR 2024-09-30T06:56:00.831366906Z {"dups":168335, "lvl":"info", "msg":"Req/resp stats", "req":491, "reqMax":6.2451141E7, "reqMin":6.2404583E7, "resp":1299, "respMax":6.2451146E7, "respMin":6.2401615E7, "skel":2, "skelMax":6.2401807E7, "skelMin":6.2401615E7, "t":"2024-09-30T06:56:00.831040877Z"}

We saw a lot of CPU usage, I think this is due to tryingn to write the heap dump.
The heap file had only 0 KB. We think there is also the problem here, that our tmp folder can only take up to 45 GB but erigons Ram usage is over 180GB. So I am not able to deliver a heap dump file to you this way.
We unwinded again, and the error seems to be disappeared.
If the integration command logs are useful for you, I can provide them here.
Is there another environment variable which would give us the option to specify the folder to where the heap dump would be written or do we need the temp folder to be bigger?

@taratorio
Copy link
Member

@MrFreezeDZ hey, thank you for trying it out. I've added the ability for you to specify the file path via another env var. Can you please run branch e2-save-oom-file and use export SAVE_HEAP_PROFILE=true && HEAP_PROFILE_FILE_PATH=/Users/Example/erigon-mem.prof && ./build/bin/erigon ...

@MrFreezeDZ
Copy link
Author

@taratorio thank you for the new environment variable. Right now we use the images from https://hub.docker.com/r/thorax/erigon/tags . It would be easier for me, if the variable would be in another image version. Until then I will try to find a way to use the branch in our CD.

@taratorio
Copy link
Member

@MrFreezeDZ separate topic but fyi, the official Erigon docker images are no longer at thorax but at https://hub.docker.com/r/erigontech/erigon.

I think I may have been able to reproduce an OOM with Alex's help. I'm investigating a fix.

@taratorio
Copy link
Member

taratorio commented Oct 21, 2024

@MrFreezeDZ I fixed a OOM - the branch is e2-stage-headers-fix-oom. However, it would be best if you can run this branch to verify that it is indeed the same OOM as you are seeing as I'm not 100% certain. That branch also has the new HEAP_PROFILE_FILE_PATH env var so you could run Erigon using export SAVE_HEAP_PROFILE=true && HEAP_PROFILE_FILE_PATH=/Users/Example/erigon-mem.pprof && ./build/bin/erigon ...

Would appreciate your help to run this if possible, otherwise this issue may be prolonged if it turns out you are suffering from a different OOM.

AskAlexSharov pushed a commit that referenced this issue Oct 22, 2024
…12404)

relates to #11387,
#11473,
#10734

tried to simulate the OOM using
#11799

What I found was infinitely growing alloc of headers when receiving new
header messages in sentry's `blockHeaders66` handler (check screenshot
below).

It looks like this is happening because in the case of a bad child
header: we delete it from the `links` map, however its parent link still
holds a reference to it so the deleted link & header never get gc-ed.
Furthermore if new similar bad hashes arrive after deletion they get
appended to their parent header's link and the children of that link can
grow indefinitely
([here](https://github.com/erigontech/erigon/blob/main/turbo/stages/headerdownload/header_algos.go#L1085-L1086)).
Ie confirmed with debug logs (note link at 13450415 has 140124
children):
```
DBUG[10-21|18:18:05.003] [downloader] InsertHeader: Rejected header parent info hash=0xb488d67deaf4103880fa972fd72a7a9be552e3bc653f124f1ad9cb45f36bcd07 height=13450415 children=140124
```

<br/>

The solution for this is to remove the bad link from its parent child
list
[here](https://github.com/erigontech/erigon/blob/main/turbo/stages/headerdownload/header_algos.go#L544)
so that 1) it gets gc-ed and 2) the children list does not grow
indefinitely.

<br/>


![oom-heap-profile2](https://github.com/user-attachments/assets/518fa658-c199-48b6-aa2d-110673264144)
yperbasis pushed a commit that referenced this issue Oct 24, 2024
@taratorio
Copy link
Member

taratorio commented Oct 25, 2024

this should be fixed in v2.60.9 (by #12405) and v3.0.0-alpha5 (by #12404), please re-open if you see it again

Tangui-Bitfly pushed a commit to Tangui-Bitfly/erigon-lib that referenced this issue Oct 30, 2024
…(#11551)

cherry-pick 2a98f6aa53ccd558543bc95ffe9bf0fad4ef278f for E2

relates to:
erigontech/erigon#10734
erigontech/erigon#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable wait until
we reach 45% or more alloc in stage_headers when "noProgressCounter >=
5" or "Rejected header marked as bad"
Tangui-Bitfly pushed a commit to Tangui-Bitfly/erigon-lib that referenced this issue Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imp1 High importance polygon
Projects
None yet
Development

No branches or pull requests

4 participants