Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing Sync Since March 13 #9767

Open
dylancaponi opened this issue Mar 20, 2024 · 16 comments
Open

Losing Sync Since March 13 #9767

dylancaponi opened this issue Mar 20, 2024 · 16 comments
Labels
imp2 Medium importance performance

Comments

@dylancaponi
Copy link

dylancaponi commented Mar 20, 2024

My node on ETH Mainnet has been getting out of sync repeatedly since March 13.
The main unique thing about my setup is that snapshots are on an external SSD (symlinked) to save space on the NVMe.

System information

Erigon version: ./erigon --version

erigon version 2.58.2-125509e4

Updated to 2.59.0 and still losing sync

OS & Version: Windows/Linux/OSX

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal

Commit hash:

Erigon Command (with flags/config):

[Unit]
Description=Erigon Execution Layer Client service
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=dylan
Restart=on-failure
RestartSec=3
KillSignal=SIGINT
TimeoutStopSec=300
ExecStart=/home/dylan/erigon/build/bin/erigon \
  --datadir /var/lib/erigon \
  --chain mainnet \
  --db.size.limit=8TB \
  --metrics \
  --pprof \
  --prune htc \
  --authrpc.jwtsecret=/secrets/jwtsecret \
  --http \
  --http.addr 192.168.0.21 \
  --http.api eth,net,web3

[Install]
WantedBy=multi-user.target

Consensus Layer:

Lighthouse v5.1.1-2a3c709
BLS library: blst-modern
SHA256 hardware acceleration: false
Allocator: jemalloc
Profile: maxperf
Specs: mainnet (true), minimal (false), gnosis (true)

Consensus Layer Command (with flags/config):

[Unit]
Description=eth beacon chain service
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=dylan
Restart=on-failure
ExecStart=/usr/bin/lighthouse bn \
  --network mainnet \
  --staking \
  --validator-monitor-auto \
  --metrics \
  --suggested-fee-recipient <redacted> \
  --checkpoint-sync-url=https://beaconstate.info \
  --execution-endpoint http://127.0.0.1:8551 \
  --execution-jwt /secrets/jwtsecret \
  --monitoring-endpoint https://beaconcha.in/api/v1/client/metrics?apikey=<redacted>&machine=Nuc

[Install]
WantedBy=multi-user.target

Chain/Network: ETH Mainnet

Expected behaviour

Stay synced.

Actual behaviour

Losing sync often

Steps to reproduce the behaviour

Backtrace

[backtrace]
@dylancaponi
Copy link
Author

Let me know what else I could grep the logs for to help diagnose this issue. I didn't see warning or error.

@dylancaponi
Copy link
Author

One thing I've noticed is that the execution phase seems very slow:

Mar 19 21:39:02 nuc erigon[986]: [INFO] [03-19|21:39:02.819] [4/12 Execution] Executed blocks         number=19470961 blk/s=0.1 tx/s=27.7 Mgas/s=2.0 gasState=0.00 batch=861.6KB alloc=3.6GB sys=8.0GB
Mar 19 21:39:41 nuc erigon[986]: [INFO] [03-19|21:39:41.942] [4/12 Execution] Executed blocks         number=19470965 blk/s=0.1 tx/s=19.4 Mgas/s=2.1 gasState=0.00 batch=1012.5KB alloc=2.9GB sys=8.0GB
Mar 19 21:40:05 nuc erigon[986]: [INFO] [03-19|21:40:05.355] [4/12 Execution] Executed blocks         number=19470969 blk/s=0.2 tx/s=28.5 Mgas/s=2.2 gasState=0.00 batch=1.1MB alloc=3.1GB sys=8.0GB

@dylancaponi
Copy link
Author

Ah, here are some errors:

Mar 19 15:19:28 nuc erigon[986]: [WARN] [03-19|15:19:28.865] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 15:25:57 nuc erigon[986]: [WARN] [03-19|15:25:57.855] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 15:41:04 nuc erigon[986]: [WARN] [03-19|15:41:04.040] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:00:14 nuc erigon[986]: [WARN] [03-19|16:00:11.640] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:14:39 nuc erigon[986]: [WARN] [03-19|16:14:39.917] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:30:22 nuc erigon[986]: [WARN] [03-19|16:30:22.702] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 16:39:22 nuc erigon[986]: [WARN] [03-19|16:39:22.187] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 17:03:45 nuc erigon[986]: [WARN] [03-19|17:03:45.462] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 17:57:16 nuc erigon[986]: [WARN] [03-19|17:57:16.289] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 18:45:31 nuc erigon[986]: [WARN] [03-19|18:45:31.304] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 18:56:29 nuc erigon[986]: [WARN] [03-19|18:56:29.358] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 19:17:38 nuc erigon[986]: [WARN] [03-19|19:17:38.316] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 19 20:50:50 nuc erigon[986]: [WARN] [03-19|20:50:50.687] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 20 01:06:51 nuc erigon[986]: [WARN] [03-20|01:06:51.827] [fetch] onNewBlock                       err="failed MdbxKV cursor.Seek(): mdbx_cursor_get: MDBX_BAD_VALSIZE: Invalid size or alignment of key or data for target database, either invalid subDB name, bucket: PoolTransaction,  key: ed80609b55bbb9a63618c9c6da1ed8d31d0a49962a98f27d69226dc9a528bcfa"
Mar 20 01:09:54 nuc erigon[986]: [WARN] [03-20|01:09:54.820] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"

@AskAlexSharov
Copy link
Collaborator

  1. Remove datadir/txpool
  2. Add —snap.stop until next erigon’s upgrade

@dylancaponi
Copy link
Author

  1. Remove datadir/txpool
  2. Add —snap.stop until next erigon’s upgrade

Thanks ended up upgrading to 2.59 and all good so far. Will reopen if not.

@dylancaponi
Copy link
Author

Reopening because I’m still losing sync on 2.59.0. Will post logs tonight.

@dylancaponi dylancaponi reopened this Mar 21, 2024
@dylancaponi
Copy link
Author

Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"
Mar 21 19:01:53 nuc erigon[441457]: [WARN] [03-21|19:01:23.623] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"
Mar 21 19:32:56 nuc erigon[441457]: [WARN] [03-21|19:32:37.412] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:32:56 nuc erigon[441457]: [WARN] [03-21|19:32:37.412] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:32:56 nuc erigon[441457]: [WARN] [03-21|19:32:37.412] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"
Mar 21 19:35:21 nuc erigon[441457]: [WARN] [03-21|19:34:58.287] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:35:21 nuc erigon[441457]: [WARN] [03-21|19:35:21.201] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 21 19:35:21 nuc erigon[441457]: [WARN] [03-21|19:35:21.201] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"

@Giulio2002
Copy link
Contributor

hey could try to use --internalcl, just to see if it is an engine API specific issue? CC @somnathb1

@dylancaponi
Copy link
Author

@somnathb1 Suggested on discord:

Could you also (in this order)
check the disk speed/performance in terms of access times - for all symlinks and datadir
check the above again after running erigon
try removing any competing process accessing the datadir (such as another erigon process, or an integration/mdbx command line)
try removing $datadir/chaindata/mdbx.lck and restart the erigon process/container

@dylancaponi
Copy link
Author

Here are fio results testing snapshots (symlinked) and datadir while Erigon is stopped and a few minutes after starting it.

Snapshots before starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon/snapshots
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Test: Laying out IO file (1 file / 500MiB)

Test: (groupid=0, jobs=1): err= 0: pid=584985: Fri Mar 22 17:38:03 2024
  read: IOPS=151k, BW=590MiB/s (618MB/s)(500MiB/848msec)
    clat (nsec): min=570, max=838099, avg=6410.46, stdev=45685.29
     lat (nsec): min=595, max=838122, avg=6437.37, stdev=45685.45
    clat percentiles (nsec):
     |  1.00th=[   612],  5.00th=[   636], 10.00th=[   644], 20.00th=[   660],
     | 30.00th=[   676], 40.00th=[   684], 50.00th=[   700], 60.00th=[   716],
     | 70.00th=[   740], 80.00th=[   788], 90.00th=[   892], 95.00th=[  1020],
     | 99.00th=[329728], 99.50th=[391168], 99.90th=[473088], 99.95th=[497664],
     | 99.99th=[602112]
   bw (  KiB/s): min=601568, max=601568, per=99.63%, avg=601568.00, stdev= 0.00, samples=1
   iops        : min=150392, max=150392, avg=150392.00, stdev= 0.00, samples=1
  lat (nsec)   : 750=71.79%, 1000=22.81%
  lat (usec)   : 2=2.29%, 4=1.52%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.04%, 500=1.48%, 750=0.05%, 1000=0.01%
  cpu          : usr=5.08%, sys=23.61%, ctx=1999, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=590MiB/s (618MB/s), 590MiB/s-590MiB/s (618MB/s-618MB/s), io=500MiB (524MB), run=848-848msec

Disk stats (read/write):
  sda: ios=1707/0, merge=0/0, ticks=1260/0, in_queue=0, util=88.46%

Datadir before starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Test: Laying out IO file (1 file / 500MiB)

Test: (groupid=0, jobs=1): err= 0: pid=585044: Fri Mar 22 17:41:01 2024
  read: IOPS=204k, BW=796MiB/s (835MB/s)(500MiB/628msec)
    clat (nsec): min=563, max=951461, avg=4532.92, stdev=31085.75
     lat (nsec): min=587, max=951488, avg=4580.54, stdev=31086.00
    clat percentiles (nsec):
     |  1.00th=[   604],  5.00th=[   636], 10.00th=[   668], 20.00th=[   724],
     | 30.00th=[   764], 40.00th=[   804], 50.00th=[   868], 60.00th=[  1020],
     | 70.00th=[  1160], 80.00th=[  1448], 90.00th=[  2008], 95.00th=[  2192],
     | 99.00th=[171008], 99.50th=[261120], 99.90th=[444416], 99.95th=[473088],
     | 99.99th=[514048]
   bw (  KiB/s): min=808432, max=808432, per=99.16%, avg=808432.00, stdev= 0.00, samples=1
   iops        : min=202108, max=202108, avg=202108.00, stdev= 0.00, samples=1
  lat (nsec)   : 750=25.47%, 1000=33.82%
  lat (usec)   : 2=30.21%, 4=8.71%, 10=0.16%, 20=0.04%, 50=0.08%
  lat (usec)   : 100=0.22%, 250=0.73%, 500=0.55%, 750=0.02%, 1000=0.01%
  cpu          : usr=11.00%, sys=47.53%, ctx=1531, majf=0, minf=13
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=796MiB/s (835MB/s), 796MiB/s-796MiB/s (835MB/s-835MB/s), io=500MiB (524MB), run=628-628msec

Disk stats (read/write):
    dm-0: ios=1982/72, merge=0/0, ticks=852/0, in_queue=852, util=86.59%, aggrios=2009/68, aggrmerge=0/4, aggrticks=871/35, aggrin_queue=0, aggrutil=84.61%
  nvme0n1: ios=2009/68, merge=0/4, ticks=871/35, in_queue=0, util=84.61%

Snapshots after starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon/snapshots
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=69.0MiB/s][r=17.7k IOPS][eta 00m:00s]
Test: (groupid=0, jobs=1): err= 0: pid=585143: Fri Mar 22 17:44:06 2024
  read: IOPS=16.9k, BW=66.1MiB/s (69.3MB/s)(500MiB/7564msec)
    clat (nsec): min=462, max=191793k, avg=58581.45, stdev=773479.15
     lat (nsec): min=482, max=191793k, avg=58607.67, stdev=773479.12
    clat percentiles (nsec):
     |  1.00th=[     548],  5.00th=[     588], 10.00th=[     604],
     | 20.00th=[     628], 30.00th=[     644], 40.00th=[     668],
     | 50.00th=[     692], 60.00th=[     724], 70.00th=[     756],
     | 80.00th=[     804], 90.00th=[     900], 95.00th=[    1080],
     | 99.00th=[ 1810432], 99.50th=[ 5406720], 99.90th=[ 7438336],
     | 99.95th=[ 8159232], 99.99th=[11075584]
   bw (  KiB/s): min=42464, max=73216, per=98.63%, avg=66762.67, stdev=10164.32, samples=15
   iops        : min=10616, max=18304, avg=16690.67, stdev=2541.08, samples=15
  lat (nsec)   : 500=0.09%, 750=67.39%, 1000=26.13%
  lat (usec)   : 2=4.14%, 4=0.58%, 10=0.09%, 20=0.01%, 50=0.02%
  lat (usec)   : 100=0.05%, 250=0.06%, 500=0.10%, 750=0.06%, 1000=0.08%
  lat (msec)   : 2=0.23%, 4=0.23%, 10=0.72%, 20=0.01%, 250=0.01%
  cpu          : usr=0.70%, sys=5.66%, ctx=1918, majf=0, minf=12
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=66.1MiB/s (69.3MB/s), 66.1MiB/s-66.1MiB/s (69.3MB/s-69.3MB/s), io=500MiB (524MB), run=7564-7564msec

Disk stats (read/write):
  sda: ios=23086/8, merge=0/24, ticks=166931/35, in_queue=114592, util=97.68%

Datadir after starting Erigon:

fio --name=Test --rw=read --size=500M --directory=/var/lib/erigon
Test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1)
Test: (groupid=0, jobs=1): err= 0: pid=585208: Fri Mar 22 17:46:07 2024
  read: IOPS=116k, BW=453MiB/s (475MB/s)(500MiB/1104msec)
    clat (nsec): min=453, max=12097k, avg=7512.28, stdev=99586.78
     lat (nsec): min=475, max=12097k, avg=7611.07, stdev=99587.34
    clat percentiles (nsec):
     |  1.00th=[    532],  5.00th=[    572], 10.00th=[    636],
     | 20.00th=[    948], 30.00th=[   1144], 40.00th=[   1672],
     | 50.00th=[   1992], 60.00th=[   2192], 70.00th=[   2288],
     | 80.00th=[   2384], 90.00th=[   2576], 95.00th=[   2768],
     | 99.00th=[ 214016], 99.50th=[ 257024], 99.90th=[ 888832],
     | 99.95th=[1597440], 99.99th=[3751936]
   bw (  KiB/s): min=406496, max=467400, per=94.22%, avg=436948.00, stdev=43065.63, samples=2
   iops        : min=101624, max=116850, avg=109237.00, stdev=10766.41, samples=2
  lat (nsec)   : 500=0.14%, 750=13.74%, 1000=7.99%
  lat (usec)   : 2=28.14%, 4=48.05%, 10=0.26%, 20=0.08%, 50=0.05%
  lat (usec)   : 100=0.11%, 250=0.88%, 500=0.41%, 750=0.03%, 1000=0.03%
  lat (msec)   : 2=0.05%, 4=0.03%, 10=0.01%, 20=0.01%
  cpu          : usr=12.06%, sys=52.49%, ctx=1192, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=128000,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=453MiB/s (475MB/s), 453MiB/s-453MiB/s (475MB/s-475MB/s), io=500MiB (524MB), run=1104-1104msec

Disk stats (read/write):
    dm-0: ios=1406/1474, merge=0/0, ticks=1012/4020, in_queue=5032, util=99.29%, aggrios=2075/2099, aggrmerge=0/67, aggrticks=1310/9904, aggrin_queue=8056, aggrutil=99.59%
  nvme0n1: ios=2075/2099, merge=0/67, ticks=1310/9904, in_queue=8056, util=99.59%

@dylancaponi
Copy link
Author

dylancaponi commented Mar 22, 2024

I went back to the time that validation actually started failing to look at logs and the errors right before failure were DumpBlocks: DumpBodies: header missed in db:

Mar 20 14:34:25 nuc erigon[986]: [INFO] [03-20|14:34:25.952] head updated                             hash=0xda49c626629e22ea4181b0990f0d4fc31e21f8cc574dd4d3f08467b466997d98 number=19476536
Mar 20 14:34:26 nuc erigon[986]: [INFO] [03-20|14:34:26.018] [snapshots] Prune Blocks                 to=18500000 limit=100
Mar 20 14:34:26 nuc erigon[986]: [WARN] [03-20|14:34:26.019] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 20 14:34:37 nuc erigon[986]: [INFO] [03-20|14:34:37.741] [NewPayload] Handling new payload        height=19476537 hash=0x37a1c155eaa2d740f2103c42a2465f20fb8be10f11f831ee46885b6378f96fbb
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.363] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.658] RPC Daemon notified of new headers       from=19476536 to=19476537 hash=0x37a1c155eaa2d740f2103c42a2465f20fb8be10f11f831ee46885b6378f96fbb header sending=13.673µs log sending=201ns
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.658] head updated                             hash=0x37a1c155eaa2d740f2103c42a2465f20fb8be10f11f831ee46885b6378f96fbb number=19476537
Mar 20 14:34:38 nuc erigon[986]: [INFO] [03-20|14:34:38.748] [snapshots] Prune Blocks                 to=18500000 limit=100
Mar 20 14:34:38 nuc erigon[986]: [WARN] [03-20|14:34:38.748] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"
Mar 20 14:34:48 nuc erigon[986]: [INFO] [03-20|14:34:48.675] [NewPayload] Handling new payload        height=19476538 hash=0xb5c6395dd860ec4b8ef314dae67cee46b322bb94e50dbd77a84aad17944de777
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.355] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.643] RPC Daemon notified of new headers       from=19476537 to=19476538 hash=0xb5c6395dd860ec4b8ef314dae67cee46b322bb94e50dbd77a84aad17944de777 header sending=14.43µs log sending=216ns
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.643] head updated                             hash=0xb5c6395dd860ec4b8ef314dae67cee46b322bb94e50dbd77a84aad17944de777 number=19476538
Mar 20 14:34:49 nuc erigon[986]: [INFO] [03-20|14:34:49.727] [snapshots] Prune Blocks                 to=18500000 limit=100
Mar 20 14:34:49 nuc erigon[986]: [WARN] [03-20|14:34:49.728] [snapshots] retire blocks                err="DumpBlocks: DumpBodies: header missed in db: block_num=18500000,  hash=80ce6bb0e244fbdf66cf0a1108273fe1ca58788efb7fb8d3a0d783d2b06d433d"

So there might be multiple issues.

@dylancaponi
Copy link
Author

Getting this again several hours after restart. Never finished syncing:

Mar 23 00:43:52 nuc erigon[585070]: [INFO] [03-23|00:42:47.572] [6/12 IntermediateHashes] Calculating Merkle root current key=d65e0322
Mar 23 00:43:52 nuc erigon[585070]: [INFO] [03-23|00:43:17.588] [6/12 IntermediateHashes] Calculating Merkle root current key=d66d67ab
Mar 23 00:43:52 nuc erigon[585070]: [INFO] [03-23|00:43:47.604] [6/12 IntermediateHashes] Calculating Merkle root current key=d67f8020
Mar 23 00:43:54 nuc erigon[585070]: [INFO] [03-23|00:43:54.232] [p2p] GoodPeers                          eth68=45 eth67=1
Mar 23 00:43:56 nuc erigon[585070]: [INFO] [03-23|00:43:56.359] [mem] memory stats                       Rss=30.6MB Size=0B Pss=30.6MB SharedClean=14.8KB SharedDirty=0B PrivateClean=25.4MB PrivateDirty=5.2MB Refer>
Mar 23 00:44:02 nuc erigon[585070]: [INFO] [03-23|00:44:02.013] [txpool] stat                            pending=0 baseFee=0 queued=30000 alloc=3.7GB sys=8.3GB
Mar 23 00:44:29 nuc erigon[585070]: [INFO] [03-23|00:44:18.321] [6/12 IntermediateHashes] Calculating Merkle root current key=d68f3ce9
Mar 23 00:44:47 nuc erigon[585070]: [INFO] [03-23|00:44:47.536] [6/12 IntermediateHashes] Calculating Merkle root current key=d697de19
Mar 23 00:45:18 nuc erigon[585070]: [INFO] [03-23|00:45:18.052] [6/12 IntermediateHashes] Calculating Merkle root current key=d6a597f2
Mar 23 00:46:02 nuc erigon[585070]: [INFO] [03-23|00:45:47.567] [6/12 IntermediateHashes] Calculating Merkle root current key=d6b5b4dd
Mar 23 00:46:12 nuc erigon[585070]: [WARN] [03-23|00:46:11.982] [engine] GetHeaderByHash                 err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 23 00:46:12 nuc erigon[585070]: [WARN] [03-23|00:46:12.281] [engine] GetTd                           err="ethereumExecutionModule.GetHeader: could not open database: context canceled"
Mar 23 00:46:12 nuc erigon[585070]: [WARN] [03-23|00:46:12.281] [engine] CurrentHeader                   err="ethereumExecutionModule.CurrentHeader: could not open database: context canceled"

@dylancaponi
Copy link
Author

dylancaponi commented Mar 29, 2024

Upgraded to:

  • Erigon v2.59.3
  • Lighthouse v5.1.3-3058b96

Lighthouse hotfix had a note about forkchoice issues and I feel like I saw this in my logs so hopefully that will fix at least part of this issue.

🤞 🤞 🤞

@yperbasis yperbasis added the imp2 Medium importance label Apr 5, 2024
@dylancaponi
Copy link
Author

@yperbasis happy to see the label. Let me know if I can provide more logs or do any debugging. I've had about 50% attestations vs missed in the last month 😢

@dylancaponi
Copy link
Author

I'm still unable to stay in sync. Here are stage times from my most recent attempt:

Jun 04 22:34:51 nuc erigon[87450]: [INFO] [06-04|22:34:51.048] [1/12 Snapshots] DONE                    in=6m10.611197956s
Jun 04 22:43:36 nuc erigon[87450]: [INFO] [06-04|22:43:36.394] [3/12 Senders] DONE                      in=3m32.91207169s
Jun 05 02:53:39 nuc erigon[87450]: [INFO] [06-05|02:53:39.319] [4/12 Execution] DONE                    in=4h10m2.924591551s
Jun 05 07:25:06 nuc erigon[87450]: [INFO] [06-05|07:25:06.130] [5/12 HashState] DONE                    in=4h31m26.810970751s
Jun 05 08:12:15 nuc erigon[87450]: [INFO] [06-05|08:12:15.014] [6/12 IntermediateHashes] DONE           in=47m8.883974433s
Jun 05 09:28:53 nuc erigon[87450]: [INFO] [06-05|09:28:53.808] [7/12 CallTraces] DONE                   in=1h16m38.79209474s
Jun 05 10:23:44 nuc erigon[87450]: [INFO] [06-05|10:23:44.496] [8/12 AccountHistoryIndex] DONE          in=54m50.68689461s
Jun 05 11:23:57 nuc erigon[87450]: [INFO] [06-05|11:23:57.046] [9/12 StorageHistoryIndex] DONE          in=1h0m12.549762117s
Jun 05 11:59:11 nuc erigon[87450]: [INFO] [06-05|11:59:11.868] [10/12 LogIndex] DONE                    in=35m14.821925573s
Jun 05 12:26:11 nuc erigon[87450]: [INFO] [06-05|12:26:11.759] [11/12 TxLookup] DONE                    in=26m59.890847703s
Jun 05 12:28:13 nuc erigon[87450]: [INFO] [06-05|12:28:13.157] [9/12 StorageHistoryIndex] Prune done    in=1m59.998599987s
Jun 05 12:50:25 nuc erigon[87450]: [INFO] [06-05|12:50:25.244] [8/12 AccountHistoryIndex] Prune done    in=22m12.087567395s
Jun 05 13:17:39 nuc erigon[87450]: [INFO] [06-05|13:17:39.817] [4/12 Execution] DONE                    in=18m25.532472159s
Jun 05 13:46:02 nuc erigon[87450]: [INFO] [06-05|13:46:02.584] [5/12 HashState] DONE                    in=28m22.767034598s
Jun 05 14:04:09 nuc erigon[87450]: [INFO] [06-05|14:04:09.089] [6/12 IntermediateHashes] DONE           in=18m6.50480355s
Jun 05 14:36:26 nuc erigon[87450]: [INFO] [06-05|14:36:26.094] [7/12 CallTraces] DONE                   in=32m17.004903009s
Jun 05 14:54:17 nuc erigon[87450]: [INFO] [06-05|14:54:17.458] [8/12 AccountHistoryIndex] DONE          in=17m51.364516027s
Jun 05 15:49:33 nuc erigon[87450]: [INFO] [06-05|15:49:33.311] [9/12 StorageHistoryIndex] DONE          in=55m15.853033727s
Jun 05 16:29:01 nuc erigon[87450]: [INFO] [06-05|16:29:01.357] [10/12 LogIndex] DONE                    in=39m28.045902965s
Jun 05 16:34:33 nuc erigon[87450]: [INFO] [06-05|16:34:33.958] [11/12 TxLookup] DONE                    in=5m32.600785363s
Jun 05 16:35:09 nuc erigon[87450]: [INFO] [06-05|16:35:09.270] [9/12 StorageHistoryIndex] Prune done    in=34.612729467s
Jun 05 16:59:06 nuc erigon[87450]: [INFO] [06-05|16:59:06.241] [8/12 AccountHistoryIndex] Prune done    in=23m56.970341548s
Jun 05 17:27:44 nuc erigon[87450]: [INFO] [06-05|17:27:44.016] [4/12 Execution] DONE                    in=21m49.530837755s
Jun 05 17:47:36 nuc erigon[87450]: [INFO] [06-05|17:47:36.035] [5/12 HashState] DONE                    in=19m52.018822307s
Jun 05 19:40:44 nuc erigon[87450]: [INFO] [06-05|19:40:44.396] [6/12 IntermediateHashes] DONE           in=1h53m8.360629727s

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Jun 6, 2024

@dylancaponi show:

  • curl http://127.0.0.1:6060/debug/metrics/prometheus
  • du -hsc datadir/*
  • mdbx_stat -ef datadir/chaindata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imp2 Medium importance performance
Projects
None yet
Development

No branches or pull requests

4 participants