Clean sync Goerli broken(?) post merge on 1.10.21 and 1.10.23 #25693

ulope · 2022-09-06T09:37:58Z

System information

Geth version: v1.10.21 / v1.10.23
OS & Version: Linux

Expected behaviour

A fresh sync from scratch of Goerli to work

Actual behaviour

Does not work.

Geth 1.10.21:
- Snap sync completes, afterwards State heal seems to continue indefinitely (aborted after 4 days)
- Last state heal log before abort:
  INFO [09-05|14:37:35.138] State heal in progress accounts=9@647.00B slots=0@0.00B codes=0@0.00B nodes=16,319,436@8.09GiB pending=4369
- Many, many Unexpected trienode heal packet messages (~70% of all log lines)
- Pivot is only changed 12 times. Last one:
  WARN [09-02|16:42:58.589] Pivot seemingly stale, moving old=7,516,913 new=7,516,977
- Suspiciously new blocks are no longer imported:
  INFO [09-05|14:37:34.872] Imported new block headers count=0 elapsed=15.422ms number=7,382,818 hash=aa32c4..48c7cc age=3w4d12h ignored=178
  - Probably because of:
    Local chain is post-merge, waiting for beacon client sync switch-over...
  - However Prsym claims geth is not synced:
    level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain
Geth 1.10.23:
- Syncing never even starts
  No sync progress is reported, apparently every peer is dropped because of:
  WARN [09-06|09:25:04.945] Snapshot extension registration failed peer=6de3885d err="peer connected on snap without compatible eth support"

Steps to reproduce the behaviour

Compose file used:

version: "3"

services:
  geth:
    image: ethereum/client-go:v1.10.23
    restart: always
    network_mode: host
    stop_grace_period: 1m
    volumes:
      - /data/geth:/data
      - jwt:/jwt
    command: >
      --goerli
      --datadir=/data
      --http
      --http.api eth,net,web3,txpool
      --http.addr 0.0.0.0
      --http.corsdomain '*'
      --ws
      --ws.api eth,web3,net
      --authrpc.jwtsecret /jwt/jwtsecret
      --authrpc.vhosts '*'
      --metrics
      --metrics.addr=0.0.0.0
      --metrics.port=9191

  prysm:
    image: gcr.io/prysmaticlabs/prysm/beacon-chain:stable
    restart: always
    network_mode: host
    volumes:
      - /data/prysm:/data
      - jwt:/jwt
    command: >
      --goerli
      --datadir=/data
      --rpc-host=0.0.0.0
      --http-web3provider=http://localhost:8551
      --jwt-secret=/jwt/jwtsecret
      --accept-terms-of-use

volumes:
  jwt:

The text was updated successfully, but these errors were encountered:

fjl · 2022-09-06T12:24:50Z

Syncing never even starts
No sync progress is reported

Can you provide more information that leads you to this conclusion? It can take a while to start sometimes, just leave it running for a bit.

rjl493456442 · 2022-09-06T12:27:43Z

We have a couple of PRs for fixing/improving snap sync on master recently. Maybe you can try to use master once we merge this PR #25694

ulope · 2022-09-06T12:49:27Z

Syncing never even starts
No sync progress is reported

Can you provide more information that leads you to this conclusion? It can take a while to start sometimes, just leave it running for a bit.

Are 20 hours enough? ;)

These are the top non unique log messages by count:

1776 Snapshot extension registration failed
 236 Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!
  73 Dropping unsynced node during sync
  28 Looking for peers
  20 Regenerated local transaction journal
  19 Writing clean trie cache to disk
  19 Persisted the clean trie cache

I also re-tried this locally before submitting the issue (the above setup is running on a droplet).

1.10.21 starts syncing within minutes.
1.10.23 has been running for 15+ minutes and shows the same behaviour as on the droplet (i.e. no sync and lots of Snapshot extension registration failed)

ulope · 2022-09-07T07:57:09Z

Update: It's now been almost 40h and still no sync. Going to stop it now and try with latest master.

holiman · 2022-09-07T08:17:44Z

I'm wondering if your node actually manages to find any peers -- is your firewall sufficiently open, so bidirectional communication can occur over the relevant ports ?

ulope · 2022-09-07T08:37:41Z

Next update:
Using commit 5ddedd2 (which includes #25694, @rjl493456442) I'm seeing the same behaviour - No sync, Snapshot extension registration failed.
Left it running for 30min.

@holiman As I wrote in the initial report 1.10.21 was able to sync (but then got stuck in the heal stage).
Just to verify doubly I switched back to the 1.10.21 docker image and sync started within ~30 seconds (but I'll just assume it will run into the healing issue again as it did so the last two attempts).
Edit: Also it definitely finds peers. Here's the complete log of the run: https://gist.github.com/ulope/776e87c5bb1e3fcd893d1a512a2f6f48

Also as I wrote yesterday, I can replicate the non-syncing behaviour locally by just running geth --goerli (>=1.10.23) with an empty datadir

holiman · 2022-09-07T09:10:24Z

@ulope that log say

Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!

Goerli is post-merge, it needs the beacon client to tell it what the head is, and then it will sync to that.

ulope · 2022-09-07T09:21:22Z

@holiman Unless I'm very much mistaken (and the release notes are wrong) 1.10.21 is just as aware of the goerli merge as later versions. However it does start snap syncing immediately as mentioned.

Also with both versions Prysm seems to wait for geth to be synced to some degree because it continually logs:

level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain

(I also tried other beacon clients before, but didn't record their output unfortunately. Can check again if that's helpful.)

So either I'm missing something else or this looks like a hen/egg problem.

holiman · 2022-09-07T09:29:28Z

They are both aware, but not quite "as awaare" :) The more recent version will spit out something like

INFO [09-07|11:25:00.972] Merge configured: 
INFO [09-07|11:25:00.972]  - Hard-fork specification:    https://github.com/ethereum/execution-specs/blob/master/network-upgrades/mainnet-upgrades/paris.md 
INFO [09-07|11:25:00.972]  - Network known to be merged: true

The difference being that this flag is set for goerli:

	// TerminalTotalDifficultyPassed is a flag specifying that the network already
	// passed the terminal total difficulty. Its purpose is to disable legacy sync
	// even without having seen the TTD locally (safer long term).
	TerminalTotalDifficultyPassed bool `json:"terminalTotalDifficultyPassed,omitempty"`

See #24538 for more info:

The rationale is that once a network transitions into PoS mode, sync is directed by the beacon client. If a new Geth instance is started may years down the line however, it will not know of the transition event, so will still attempt to do a PoW based legacy sync. The legacy sync will need to "fail" when TTD is reached and sync swapped from legacy algo to beacon algo.

Snehapati11 · 2022-09-07T09:54:56Z

We are closely following this issue as we are also receiving the same message “Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!”. Our current setup is GETH 1.10.23 and prysm alpine image. We are connecting to network goerli prater.

fjl · 2022-09-07T13:09:39Z

Maybe something is broken in Prysm <-> Geth in TTD-passed mode? AFAIK geth needs a signal from the CL to start syncing. Could be that this signal doesn't come?

fjl · 2022-09-07T13:56:57Z

What should happen is: geth should print something like

INFO [09-07|15:54:25.313] Forkchoice requested sync to new head    number=7,547,668 hash=2218c5..1c5aa2

If it doesn't print that, but a CL is attached, it means the CL is not sending FcU requests, probably because it is waiting for geth to sync.

fjl · 2022-09-07T14:04:15Z

Can confirm that geth from master branch (at commit d30e39b) did start syncing after a couple seconds. I ran geth like this:

./build/bin/geth --goerli --http

And lighthouse like this:

lighthouse beacon_node --network goerli --execution-endpoint http://127.0.0.1:8551 --execution-jwt ~/.ethereum/goerli/geth/jwtsecret --checkpoint-sync-url http://...

ulope · 2022-09-07T15:37:02Z

@Snehapati11 It seems that Prysm doesn't even start syncing the beacon chain in our setup. Is that the case for you too?

I've tried now with both lighthouse and nimbus. They both at least start syncing the beacon chain (very very slowly though, current ETA 6d+)

@fjl Hm I assume that start syncing signal will only come once the beacon chain is synced. So this might after all be not a geth problem.

I'll investigate further.

ulope · 2022-09-07T15:38:23Z

@fjl Was your lighthouse node already synced?

fjl · 2022-09-07T17:17:35Z

I used checkpoint sync and it's a bit faster. See this guide for more info: https://lighthouse-book.sigmaprime.io/checkpoint-sync.html#use-infura-as-a-remote-beacon-node-provider

ulope · 2022-09-07T20:18:53Z

@fjl But at the point where you started both clients the beacon chain wasn’t finished syncing yet?

I’ll try replicating that tomorrow.

fjl · 2022-09-07T20:45:35Z

I start both clients in quick succession, and they both go into sync kind of quickly. This is with a completely blank DB.

Pretty sure this is an issue with prysm. Maybe it doesn't enable optimistic sync by default?

begetan · 2022-09-07T22:27:14Z

I confirm that Geth fresh sync is broken for Goerli on Geth v1.10.23 with Prysm v3.1.0
I feel like it was working with Prysm v3.0.0

We run the same setup as for Ropsten and Mainnet and they are fine! This issue is repeatable for different nodes!

Here is Geth logs:
WARN [09-07|22:19:10.142] Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!

Here is Prysm logs:
time="2022-09-07 22:21:47" level=info msg="Processing block batch of size 63 starting from 0xd0c9baf6... 85760/3840108 - estimated time remaining 166h51m35s" blocksPerSecond=6.2 peers=200 prefix=initial-sync time="2022-09-07 22:21:57" level=info msg="Ready for The Merge" latestDifficulty=1 network=prater prefix=powchain terminalDifficulty=10790000

Geth sync status:
curl -s -X POST -H "Content-Type:application/json" --data '{"jsonrpc":"2.0","method":"eth_syncing","id":1}' localhost:8545 {"jsonrpc":"2.0","id":1,"result":false}
There is no data in Geth directory

We also know that sync status is broken for the latest Geth, because a bunch of our monitoring tools have issues

begetan · 2022-09-07T22:48:19Z

I've switched different version of prysm and it didn't help.

geth version 1.10.21 has started syncing immediately

begetan · 2022-09-07T23:02:28Z

@ulope I just want to say that infinite "State heal in progress" may be due to hardware issue. If you run on cloud, try to spin up a new machine. If you go with bare metal, you need probably better hardware. This is unrelated to the broken sync issue, but it's quite common. I am seeing it in 10-20% launches on low spec machines.

Snehapati11 · 2022-09-08T03:55:46Z

@ulope Beacon node is still is in progress .Are you facing any issues while geth goerli syncing process? It shows beacon client is online but not passing the consensus update.

Snehapati11 · 2022-09-08T09:27:28Z

@ulope We have just successfully tested Goerli version 1.10.23 with prysm version 3.0.0 and we are no longer seeing the issue that " Beacon client online, but never received consensus updates. Please ensure your beacon client is operational to follow the chain!".

It would appear prysm version 3.1.0 has an issue as you suggested.

begetan · 2022-09-09T12:36:07Z

Erigon + Prysm v3.1.0 successfully synced from scratch.

0xDualCube · 2022-09-12T21:02:14Z

checkpoint

the checkpoint sync was key for me to get lighthouse to poke geth and get it to start syncing

https://notes.ethereum.org/@launchpad/checkpoint-sync#EF-DevOps-Endpoints

MariusVanDerWijden · 2022-09-13T09:49:51Z

Looks like this issue is resolved, will close. Feel free to open a new issue if geth sync is broken for you

begetan · 2022-09-13T10:30:13Z

Why you close a critical issue without fix?

It should be either fixed or official announced that old fresh synch method is deprecated.

MariusVanDerWijden · 2022-09-13T10:46:49Z

We've successfully synced multiple nodes on goerli and never ran into this issue.
What do you mean with "old fresh synch method is deprecated." ?

begetan · 2022-09-13T11:03:49Z

I've repeated this issue today for Goerli and Ropsten as well. The probability is not 100%, for Ropsten it was in 2 times from 4 tries, and for Goerli it was 3 times from 4 tries.

This issue will appear probably on Mainnet after The Merge, because sync condition is changed for all Post-Merge network.

fjl · 2022-09-13T12:36:34Z

The problem here is with the CL clients (Prysm, Lighthouse, etc.). The CL client needs to start syncing the beacon chain optimistically and start delivering ForkchoiceUpdated requests to geth, otherwise geth will not start syncing.

fjl · 2022-09-13T12:40:44Z

I have brought this up in chat with CL devs, let's see how they respond.

begetan · 2022-09-13T12:40:59Z

Replacing geth with version v1.10.21 always solve the problem.
Unfortunately the logs provided in the firs message is not quite relevant to this issue.

It would be better to open a new issue with more relevant details

fjl · 2022-09-13T12:49:34Z

geth v1.10.21 'works' because it always starts the legacy non-PoS sync. It's not a good fix long term.

begetan · 2022-09-13T13:36:03Z

@fjl this is more relevant description of fresh sync issue: #25753

ulope · 2022-09-13T15:09:23Z

Sorry for the late reply (with the merge looming time is a bit scarce). So with a checkpoint synced lighthouse and geth 1.10.22+ I was able to successfully sync.

So I'd say at least for me this was definitely (in part) user error.

However, having said that I do find that this is a very drastic change in behaviour esp. for a patch release. Syncing has always started on its own in the 7+ years history.

IMO this should have been geth 2.0.

fjl · 2022-09-15T11:28:35Z

After the merge, Geth requires input from the consensus layer to find the correct chain. There is no way for it to know the sync target without the CL. This is a protocol limitation, and it's why we changed it in the release after the merge on Goerli.

We are working on alternatives to the engine API connection, so Geth may potentially be able to sync on its own again in the future.

daedlock · 2022-09-18T19:06:03Z

I want to run EL without CL! So, I can confirm geth-v1.10.21 fixes the sync stalling. Looking forward to a long term fix in HEAD

thomaseth2 · 2022-10-12T12:23:59Z

geth v1.10.25 and still seems similar issue, the beacon node takes ages to sync and dont show correct time estimations,
INFO powchain: Ready for The Merge latestDifficulty=17179869184 network=mainnet terminalDifficulty=58750000000000000000000
Keep seeing this from Prysm,

And this from geth:
Snapshot extension registration failed peer=a05e766c err="peer connected on snap without compatible eth support"

Is every new sync even with a light node is by default syncing from start?

I read that the fix seems to be the checkpoint,

As. someone who was. able to spin a node quickly before,

This seems a downgrade. of usability from past geth version,

thomaseth2 · 2022-10-12T12:34:57Z

So for Prysm is there really sync checkpoint other then local or testnet nodes or dowloading a file? https://notes.ethereum.org/@launchpad/checkpoint-sync

thomaseth2 · 2022-10-13T11:03:12Z

this checkpoint is life changer: --checkpoint-sync-url=https://beaconstate.ethstaker.cc --genesis-beacon-api-url=https://beaconstate.ethstaker.cc

alperensozer · 2022-11-16T12:58:55Z

Geth/v1.10.26 + Prysm 3.1.2 still same problem on goerli-prater from scratch. 2 days still no sycing.

But with

--checkpoint-sync-url=https://goerli.checkpoint-sync.ethpandaops.io
--genesis-beacon-api-url=https://goerli.checkpoint-sync.ethpandaops.io

it started syncing immediately. Thanks @thomaseth2 for https://notes.ethereum.org/@launchpad/checkpoint-sync.

icemagno · 2023-08-17T03:15:18Z

Any progress?
I can't put my private blockchain to sync. Same error.

My execution can connect to another execution client but no sync.

The beacon is complaining level=error msg="Unable to process past deposit contract logs, perhaps your execution client is not fully synced" error="no contract code at given address" prefix=powchain forever and receive a "goodbye" from the another consensus peer when try to connect..

ulope added the type:bug label Sep 6, 2022

ligi added the status:triage label Sep 6, 2022

ligi added the need:more-information label Sep 6, 2022

ligi removed the status:triage label Sep 6, 2022

MariusVanDerWijden closed this as completed Sep 13, 2022

fjl reopened this Sep 13, 2022

fjl closed this as completed Sep 15, 2022

Clean sync Goerli broken(?) post merge on 1.10.21 and 1.10.23 #25693

Clean sync Goerli broken(?) post merge on 1.10.21 and 1.10.23 #25693

Comments

ulope commented Sep 6, 2022 • edited Loading

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

fjl commented Sep 6, 2022

rjl493456442 commented Sep 6, 2022

ulope commented Sep 6, 2022 • edited Loading

ulope commented Sep 7, 2022

holiman commented Sep 7, 2022

ulope commented Sep 7, 2022 • edited Loading

holiman commented Sep 7, 2022

ulope commented Sep 7, 2022

holiman commented Sep 7, 2022 • edited Loading

Snehapati11 commented Sep 7, 2022

fjl commented Sep 7, 2022

fjl commented Sep 7, 2022

fjl commented Sep 7, 2022

ulope commented Sep 7, 2022

ulope commented Sep 7, 2022

fjl commented Sep 7, 2022

ulope commented Sep 7, 2022

fjl commented Sep 7, 2022

begetan commented Sep 7, 2022

begetan commented Sep 7, 2022

begetan commented Sep 7, 2022

Snehapati11 commented Sep 8, 2022

Snehapati11 commented Sep 8, 2022

begetan commented Sep 9, 2022

0xDualCube commented Sep 12, 2022

MariusVanDerWijden commented Sep 13, 2022

begetan commented Sep 13, 2022

MariusVanDerWijden commented Sep 13, 2022

begetan commented Sep 13, 2022

fjl commented Sep 13, 2022

fjl commented Sep 13, 2022

begetan commented Sep 13, 2022

fjl commented Sep 13, 2022

begetan commented Sep 13, 2022

ulope commented Sep 13, 2022

fjl commented Sep 15, 2022

daedlock commented Sep 18, 2022 • edited Loading

thomaseth2 commented Oct 12, 2022

thomaseth2 commented Oct 12, 2022

thomaseth2 commented Oct 13, 2022

alperensozer commented Nov 16, 2022 • edited Loading

icemagno commented Aug 17, 2023 • edited Loading

ulope commented Sep 6, 2022 •

edited

Loading

ulope commented Sep 6, 2022 •

edited

Loading

ulope commented Sep 7, 2022 •

edited

Loading

holiman commented Sep 7, 2022 •

edited

Loading

daedlock commented Sep 18, 2022 •

edited

Loading

alperensozer commented Nov 16, 2022 •

edited

Loading

icemagno commented Aug 17, 2023 •

edited

Loading