geth 1.13.11 cannot sync with prysm 4.2.0 or above 4.2.0 #13557

tunlong · 2024-01-30T13:11:08Z

Describe the bug

At first, everything was normal, but it couldn't reboot. Once restarted and resynchronized, geth couldn't catch up. I tried the RC version of the beacon, but it didn't work either. Finally, I rolled back to version 4.1.1, and everything was fine again.

beacon shows some errors

geth stucks at age 12m. Although geth gets stuck, the geth console command (eth.syncing) shows that synchronization is completed.

Has this worked before in a previous version?

beacon 4.1.1 is working fine.

🔬 Minimal Reproduction

1.stop beacon and geth
2.start to sync again
3.geth stucks

Error

[2024-01-30 20:05:03] ERROR blockchain: Could not process slots to get payload attribute error=could not process slots: context deadline exceeded

Platform(s)

Linux (x86)

What version of Prysm are you running? (Which release)

4.2.0

Anything else relevant (validator index / public key)?

No response

The text was updated successfully, but these errors were encountered:

prestonvanloon · 2024-01-30T15:37:00Z

This seems like a geth issue. In your geth logs, it seems to take hundreds of milliseconds to process a single block, sometimes more than one second. It should be much faster than that. Typical processing is less than 40ms to 150ms.

When I've seen this in the past, I deleted the geth db and resynced geth. It's possible that there was an improper shutdown and state healing is ongoing in the background? See ethereum/go-ethereum#28855 (comment)

prestonvanloon · 2024-01-30T15:40:12Z

Potentially related

Imported new potential chain segment poor performance when ancient is on external drive — v1.13.5 ethereum/go-ethereum#28534
[Path Scheme] - Importing new chain segment insanely slow, can't repro ethereum/go-ethereum#28143
Pebble strange fluctuations in importing chains segments ethereum/go-ethereum#28079

tunlong · 2024-01-30T23:17:16Z

This seems like a geth issue. In your geth logs, it seems to take hundreds of milliseconds to process a single block, sometimes more than one second. It should be much faster than that. Typical processing is less than 40ms to 150ms.

When I've seen this in the past, I deleted the geth db and resynced geth. It's possible that there was an improper shutdown and state healing is ongoing in the background? See ethereum/go-ethereum#28855 (comment)

Thank you. But why did beacon 4.1.1 work fine. I just replaced beacon version to 4.1.1 and didn't do anything else. Before trying 4.1.1, restarting the machine or replacing RC version didn't help.

keithchew · 2024-02-03T10:48:37Z

I did an upgrade from v4.0.8 to v4.2.1 on a Goerli node for Dencun and all went fine. So I decided to perform the same upgrade for a mainnet node (running Erigon). I can confirm that after the upgrade, the node cannot keep up to the head, and in prysm I get the same error as above:

time="2024-02-03 10:11:28" level=error msg="Could not process slots to get payload attribute" error="could not process slots: context deadline exceeded" prefix=blockchain

I then downgraded to v4.1.1 and the node synced up to head without any issues. I did notice when it was in v4.2.1, the CPU was constantly hitting 100%, but with v4.1.1 it was more under control (30-50%). I also noticed the active peers was around 70 vs 45, which I believe is from this PR:
#13278

but it probably has nothing to do with the issue.

nisdas · 2024-02-06T06:51:19Z

For those running into this issue, this PR should hopefully fix the issue. We will tag a rc soon and if all is well this will make it to our next release.

nisdas · 2024-02-06T15:30:00Z

We have a rc here:
https://github.com/prysmaticlabs/prysm/releases/tag/v4.2.2-rc.0

If this goes well, it will be our next release. You can give it a try to see if it resolves your issue

keithchew · 2024-02-06T21:02:44Z

Hi @nisdas

I am testing the RC, and it seems to have resolved the issue! It used to trickle each payload one at a time, but now it is pushing a whole bunch through for the node to catch up. CPU is also back to normal, great work!

I have also tested this on the Goerli node and all good there too. I did get this in the logs on startup, but everything seems operational after that...

time="2024-02-06 20:44:56" level=error msg="Error encountered while warming up blob pruner cache." error="pruning failed for 1 root directories: blobs could not be pruned for some roots"

prestonvanloon · 2024-02-06T21:04:59Z

@keithchew Do you have any error level logs prior to that one? It should have printed at least one log immediately before that to explain why it was unable to prune a directory.

keithchew · 2024-02-06T21:18:35Z

@prestonvanloon you are right, sorry about that, here are the 2 errors above it:

time="2024-02-06 20:44:21" level=error msg="Unable to prune directory" directory=0x9265c01e6d2fdf61df34b0b025b61a19a1040a78ff13748f634610d4342ac82d error="slot could not be read from blob file 3.ssz: EOF"
time="2024-02-06 20:44:25" level=error msg="Could not clean up dirty states" error="OriginBlockRoot: not found in db" prefix=state-gen
time="2024-02-06 20:44:56" level=error msg="Error encountered while warming up blob pruner cache." error="pruning failed for 1 root directories: blobs could not be pruned for some roots"

prestonvanloon · 2024-02-06T21:20:56Z

Thanks @keithchew. The unable to prune directory issue is something we are debugging on your log is very helpful. It shouldn't be a problem at runtime and you could ignore it for now.

The workaround is to delete the directory $DATADIR/blobs/0x9265c01e6d2fdf61df34b0b025b61a19a1040a78ff13748f634610d4342ac82d

prestonvanloon · 2024-02-23T14:20:51Z

@keithchew Following up on this. We did find another bug where blobs were not being saved properly. The issue you mentioned #13557 (comment) has been resolved in #13648

Edit: #13648 stops the issue from happening again, but does not clear bad blobs from disk. Delete your disk and resync or delete any zero byte ssz files from your blobs directory to stop the log messages.

tunlong added the Bug Something isn't working label Jan 30, 2024

prestonvanloon closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geth 1.13.11 cannot sync with prysm 4.2.0 or above 4.2.0 #13557

geth 1.13.11 cannot sync with prysm 4.2.0 or above 4.2.0 #13557

tunlong commented Jan 30, 2024

prestonvanloon commented Jan 30, 2024

prestonvanloon commented Jan 30, 2024

tunlong commented Jan 30, 2024

keithchew commented Feb 3, 2024

nisdas commented Feb 6, 2024

nisdas commented Feb 6, 2024

keithchew commented Feb 6, 2024

prestonvanloon commented Feb 6, 2024

keithchew commented Feb 6, 2024

prestonvanloon commented Feb 6, 2024

prestonvanloon commented Feb 23, 2024 •

edited

Loading

geth 1.13.11 cannot sync with prysm 4.2.0 or above 4.2.0 #13557

geth 1.13.11 cannot sync with prysm 4.2.0 or above 4.2.0 #13557

Comments

tunlong commented Jan 30, 2024

Describe the bug

Has this worked before in a previous version?

🔬 Minimal Reproduction

Error

Platform(s)

What version of Prysm are you running? (Which release)

Anything else relevant (validator index / public key)?

prestonvanloon commented Jan 30, 2024

prestonvanloon commented Jan 30, 2024

tunlong commented Jan 30, 2024

keithchew commented Feb 3, 2024

nisdas commented Feb 6, 2024

nisdas commented Feb 6, 2024

keithchew commented Feb 6, 2024

prestonvanloon commented Feb 6, 2024

keithchew commented Feb 6, 2024

prestonvanloon commented Feb 6, 2024

prestonvanloon commented Feb 23, 2024 • edited Loading

prestonvanloon commented Feb 23, 2024 •

edited

Loading