Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to keep up with pending state heal #26687

Closed
kumavis opened this issue Feb 14, 2023 · 8 comments
Closed

Unable to keep up with pending state heal #26687

kumavis opened this issue Feb 14, 2023 · 8 comments
Labels

Comments

@kumavis
Copy link
Member

kumavis commented Feb 14, 2023

System information

Geth version: instance=Geth/v1.10.26-stable-e5eb32ac/linux-amd64/go1.18.8
CL client & version: prysm:stable
OS & Version: Linux name 5.19.0-31-generic #32-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 20 15:20:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Expected behaviour

"State heal in progress" eventually completes

Actual behaviour

"State heal in progress" continues indefinitely

Steps to reproduce the behaviour

  • bringing machine back on after a two week down time <-- important
  • running on Digital Ocean Storage-Optimized NVMe

Here is a graph of the pending "State heal in progress" field in the log line over time

eth1              | INFO [02-14|11:40:13.492] State heal in progress                   accounts=1,609,774@75.38MiB slots=4,938,028@372.80MiB codes=13301@88.86MiB nodes=46,528,742@8.34GiB pending=38937

image

lots of memory available for system disk Buffer
image

some disk statistics. iowait is ~50%
image
image
image
image

geth syncing dashboard snapshot https://snapshots.raintank.io/dashboard/snapshot/j57U07jPZBxmA5wxR2bM7PkBcfNCKIpx
system dashboard snapshot https://snapshots.raintank.io/dashboard/snapshot/duM8SNGtvhRkU3e9UO9jDsF4BT3J2n77

here is a small section of logs https://gist.github.com/kumavis/889eb03156fa7cc54935917b2539f10f

let me know what additional data can help

@kumavis
Copy link
Member Author

kumavis commented Feb 14, 2023

erata: ancient db is on a network connected storage but doesnt seem to cause any issue bc ancient read/writes are almost zero

  eth1:
    image: ethereum/client-go:stable
    restart: always
    # give geth lots of time to restart gracefully
    stop_grace_period: 5m
    hostname: eth1
    container_name: eth1
    # https://geth.ethereum.org/docs/interface/command-line-options
    command:
      - --datadir=/primary/
      - --datadir.ancient=/secondary/
      - --maxpeers=100
      - --http
      - --http.addr=0.0.0.0
      - --http.port=8545
      - --http.vhosts=eth1
      - --http.api=eth,net,web3
      - --authrpc.jwtsecret=/secrets/jwtsecret
      - --authrpc.addr=0.0.0.0
      - --authrpc.vhosts=*
      - --metrics
      - --metrics.addr=0.0.0.0
      # - --verbosity=5
      # - --pprof
      # - --pprof.addr=0.0.0.0
      # network
      - ${ETH1_NETWORK_FLAG}
    healthcheck:
      test: [ "CMD-SHELL", "geth attach --exec eth.blockNumber" ]
      interval: 10s
      timeout: 5s
      retries: 5
    ports:
      # eth auth rpc
      - 8551
      # metrics for prometheus
      - 6060
      # public p2p
      - 30303:30303/tcp
      - 30303:30303/udp
    volumes:
      - /var/data/geth:/primary
      - /mnt/extra-data/geth/chaindata/ancient:/secondary
      - /var/secrets:/secrets/:ro
    networks:
      - default
      - outside
    environment:
      LETSENCRYPT_HOST: $ETH1_DOMAIN
      VIRTUAL_HOST: $ETH1_DOMAIN
      VIRTUAL_PORT: 8551
    <<: *logging

@holiman
Copy link
Contributor

holiman commented Feb 14, 2023

Before the two week downtime, I'm assuming the node was not finished synced? Which means, that aborting the sync while it is performing state-heal, and then continuing two weeks later, all the snap data will be bitrotted.

And the impact is that you'll be forced to basically do a fast-sync over the snap protocol, and that's going to be a pita.

@karalabe
Copy link
Member

Was the node synced before you turned the machine off?

If you were halfway through a sync (or all the way through really, but not yet fully synced) and stopped in between, all the old data will bitrot like crazy in 2 weeks. In that case, just resync from zero (keep the ancients to avoid redownloading the chain part).

@kumavis
Copy link
Member Author

kumavis commented Feb 14, 2023

the node was synced before yes. it was likely in a boot loop due to insufficient space. there was an unclean shutdown 2 weeks ago at the start of the down time

@Francesreid

This comment was marked as off-topic.

@Francesreid

This comment was marked as off-topic.

@kumavis
Copy link
Member Author

kumavis commented Feb 15, 2023

eventually completed state heal -- i think my machine can barely keep up with state heals and the "random walk" like behavior in the pending graph eventually was able to wander to zero and complete

@kumavis kumavis closed this as completed Feb 15, 2023
@defeedme
Copy link

defeedme commented Apr 7, 2023

eventually completed state heal -- i think my machine can barely keep up with state heals and the "random walk" like behavior in the pending graph eventually was able to wander to zero and complete

hi I'm running windows with prysm and in the same boat - finally got past state download in progress and now it's chain download in progress and state healing in progress and jumps around in a range from 10 min ETA to 20 min ETA - I have a fast 2tb ssd , 20gb ram , i5 and been validating since genesis with no issues at all. this is the longest I've been down after a power failure. I updated everything to latest versions and cleared out chaindata to start from scratch.. one thing I did notice is the geth chaindata folder has 121,000 files now even after starting from scratch. Before I cleared it, it only had 55,000 files.. I'm wondering if that many files is slowing down the ssd. any help is appreciated. Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants