Unable to keep up with pending state heal #26687

kumavis · 2023-02-14T11:49:58Z

System information

Geth version: instance=Geth/v1.10.26-stable-e5eb32ac/linux-amd64/go1.18.8
CL client & version: prysm:stable
OS & Version: Linux name 5.19.0-31-generic #32-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 20 15:20:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Expected behaviour

"State heal in progress" eventually completes

Actual behaviour

"State heal in progress" continues indefinitely

Steps to reproduce the behaviour

bringing machine back on after a two week down time <-- important
running on Digital Ocean Storage-Optimized NVMe

Here is a graph of the pending "State heal in progress" field in the log line over time

eth1              | INFO [02-14|11:40:13.492] State heal in progress                   accounts=1,609,774@75.38MiB slots=4,938,028@372.80MiB codes=13301@88.86MiB nodes=46,528,742@8.34GiB pending=38937

lots of memory available for system disk Buffer

some disk statistics. iowait is ~50%

geth syncing dashboard snapshot https://snapshots.raintank.io/dashboard/snapshot/j57U07jPZBxmA5wxR2bM7PkBcfNCKIpx
system dashboard snapshot https://snapshots.raintank.io/dashboard/snapshot/duM8SNGtvhRkU3e9UO9jDsF4BT3J2n77

here is a small section of logs https://gist.github.com/kumavis/889eb03156fa7cc54935917b2539f10f

let me know what additional data can help

The text was updated successfully, but these errors were encountered:

kumavis · 2023-02-14T11:56:30Z

erata: ancient db is on a network connected storage but doesnt seem to cause any issue bc ancient read/writes are almost zero

  eth1:
    image: ethereum/client-go:stable
    restart: always
    # give geth lots of time to restart gracefully
    stop_grace_period: 5m
    hostname: eth1
    container_name: eth1
    # https://geth.ethereum.org/docs/interface/command-line-options
    command:
      - --datadir=/primary/
      - --datadir.ancient=/secondary/
      - --maxpeers=100
      - --http
      - --http.addr=0.0.0.0
      - --http.port=8545
      - --http.vhosts=eth1
      - --http.api=eth,net,web3
      - --authrpc.jwtsecret=/secrets/jwtsecret
      - --authrpc.addr=0.0.0.0
      - --authrpc.vhosts=*
      - --metrics
      - --metrics.addr=0.0.0.0
      # - --verbosity=5
      # - --pprof
      # - --pprof.addr=0.0.0.0
      # network
      - ${ETH1_NETWORK_FLAG}
    healthcheck:
      test: [ "CMD-SHELL", "geth attach --exec eth.blockNumber" ]
      interval: 10s
      timeout: 5s
      retries: 5
    ports:
      # eth auth rpc
      - 8551
      # metrics for prometheus
      - 6060
      # public p2p
      - 30303:30303/tcp
      - 30303:30303/udp
    volumes:
      - /var/data/geth:/primary
      - /mnt/extra-data/geth/chaindata/ancient:/secondary
      - /var/secrets:/secrets/:ro
    networks:
      - default
      - outside
    environment:
      LETSENCRYPT_HOST: $ETH1_DOMAIN
      VIRTUAL_HOST: $ETH1_DOMAIN
      VIRTUAL_PORT: 8551
    <<: *logging

holiman · 2023-02-14T13:39:11Z

Before the two week downtime, I'm assuming the node was not finished synced? Which means, that aborting the sync while it is performing state-heal, and then continuing two weeks later, all the snap data will be bitrotted.

And the impact is that you'll be forced to basically do a fast-sync over the snap protocol, and that's going to be a pita.

karalabe · 2023-02-14T13:39:18Z

Was the node synced before you turned the machine off?

If you were halfway through a sync (or all the way through really, but not yet fully synced) and stopped in between, all the old data will bitrot like crazy in 2 weeks. In that case, just resync from zero (keep the ancients to avoid redownloading the chain part).

kumavis · 2023-02-14T21:40:28Z

the node was synced before yes. it was likely in a boot loop due to insufficient space. there was an unclean shutdown 2 weeks ago at the start of the down time

kumavis · 2023-02-15T01:26:07Z

eventually completed state heal -- i think my machine can barely keep up with state heals and the "random walk" like behavior in the pending graph eventually was able to wander to zero and complete

defeedme · 2023-04-07T04:06:17Z

eventually completed state heal -- i think my machine can barely keep up with state heals and the "random walk" like behavior in the pending graph eventually was able to wander to zero and complete

hi I'm running windows with prysm and in the same boat - finally got past state download in progress and now it's chain download in progress and state healing in progress and jumps around in a range from 10 min ETA to 20 min ETA - I have a fast 2tb ssd , 20gb ram , i5 and been validating since genesis with no issues at all. this is the longest I've been down after a power failure. I updated everything to latest versions and cleared out chaindata to start from scratch.. one thing I did notice is the geth chaindata folder has 121,000 files now even after starting from scratch. Before I cleared it, it only had 55,000 files.. I'm wondering if that many files is slowing down the ssd. any help is appreciated. Thanks in advance.

kumavis added the type:bug label Feb 14, 2023

This comment was marked as off-topic.

Sign in to view

kumavis closed this as completed Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to keep up with pending state heal #26687

Unable to keep up with pending state heal #26687

kumavis commented Feb 14, 2023 •

edited

Loading

kumavis commented Feb 14, 2023

holiman commented Feb 14, 2023

karalabe commented Feb 14, 2023

kumavis commented Feb 14, 2023 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

kumavis commented Feb 15, 2023

defeedme commented Apr 7, 2023

Unable to keep up with pending state heal #26687

Unable to keep up with pending state heal #26687

Comments

kumavis commented Feb 14, 2023 • edited Loading

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

kumavis commented Feb 14, 2023

holiman commented Feb 14, 2023

karalabe commented Feb 14, 2023

kumavis commented Feb 14, 2023 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

kumavis commented Feb 15, 2023

defeedme commented Apr 7, 2023

kumavis commented Feb 14, 2023 •

edited

Loading

kumavis commented Feb 14, 2023 •

edited

Loading