Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geth 1.13.11 cannot sync with prysm 4.2.0 or above 4.2.0 #13557

Closed
tunlong opened this issue Jan 30, 2024 · 11 comments
Closed

geth 1.13.11 cannot sync with prysm 4.2.0 or above 4.2.0 #13557

tunlong opened this issue Jan 30, 2024 · 11 comments
Labels
Bug Something isn't working

Comments

@tunlong
Copy link

tunlong commented Jan 30, 2024

Describe the bug

At first, everything was normal, but it couldn't reboot. Once restarted and resynchronized, geth couldn't catch up. I tried the RC version of the beacon, but it didn't work either. Finally, I rolled back to version 4.1.1, and everything was fine again.

beacon shows some errors

image

geth stucks at age 12m. Although geth gets stuck, the geth console command (eth.syncing) shows that synchronization is completed.

image

Has this worked before in a previous version?

beacon 4.1.1 is working fine.

🔬 Minimal Reproduction

1.stop beacon and geth
2.start to sync again
3.geth stucks

Error

[2024-01-30 20:05:03] ERROR blockchain: Could not process slots to get payload attribute error=could not process slots: context deadline exceeded

Platform(s)

Linux (x86)

What version of Prysm are you running? (Which release)

4.2.0

Anything else relevant (validator index / public key)?

No response

@tunlong tunlong added the Bug Something isn't working label Jan 30, 2024
@prestonvanloon
Copy link
Member

This seems like a geth issue. In your geth logs, it seems to take hundreds of milliseconds to process a single block, sometimes more than one second. It should be much faster than that. Typical processing is less than 40ms to 150ms.

When I've seen this in the past, I deleted the geth db and resynced geth. It's possible that there was an improper shutdown and state healing is ongoing in the background? See ethereum/go-ethereum#28855 (comment)

@tunlong
Copy link
Author

tunlong commented Jan 30, 2024

This seems like a geth issue. In your geth logs, it seems to take hundreds of milliseconds to process a single block, sometimes more than one second. It should be much faster than that. Typical processing is less than 40ms to 150ms.

When I've seen this in the past, I deleted the geth db and resynced geth. It's possible that there was an improper shutdown and state healing is ongoing in the background? See ethereum/go-ethereum#28855 (comment)

Thank you. But why did beacon 4.1.1 work fine. I just replaced beacon version to 4.1.1 and didn't do anything else. Before trying 4.1.1, restarting the machine or replacing RC version didn't help.

@keithchew
Copy link
Contributor

I did an upgrade from v4.0.8 to v4.2.1 on a Goerli node for Dencun and all went fine. So I decided to perform the same upgrade for a mainnet node (running Erigon). I can confirm that after the upgrade, the node cannot keep up to the head, and in prysm I get the same error as above:

time="2024-02-03 10:11:28" level=error msg="Could not process slots to get payload attribute" error="could not process slots: context deadline exceeded" prefix=blockchain

I then downgraded to v4.1.1 and the node synced up to head without any issues. I did notice when it was in v4.2.1, the CPU was constantly hitting 100%, but with v4.1.1 it was more under control (30-50%). I also noticed the active peers was around 70 vs 45, which I believe is from this PR:
#13278

but it probably has nothing to do with the issue.

@nisdas
Copy link
Member

nisdas commented Feb 6, 2024

For those running into this issue, this PR should hopefully fix the issue. We will tag a rc soon and if all is well this will make it to our next release.

@nisdas
Copy link
Member

nisdas commented Feb 6, 2024

We have a rc here:
https://github.com/prysmaticlabs/prysm/releases/tag/v4.2.2-rc.0

If this goes well, it will be our next release. You can give it a try to see if it resolves your issue

@keithchew
Copy link
Contributor

Hi @nisdas

I am testing the RC, and it seems to have resolved the issue! It used to trickle each payload one at a time, but now it is pushing a whole bunch through for the node to catch up. CPU is also back to normal, great work!

I have also tested this on the Goerli node and all good there too. I did get this in the logs on startup, but everything seems operational after that...

time="2024-02-06 20:44:56" level=error msg="Error encountered while warming up blob pruner cache." error="pruning failed for 1 root directories: blobs could not be pruned for some roots"

@prestonvanloon
Copy link
Member

@keithchew Do you have any error level logs prior to that one? It should have printed at least one log immediately before that to explain why it was unable to prune a directory.

@keithchew
Copy link
Contributor

@prestonvanloon you are right, sorry about that, here are the 2 errors above it:

time="2024-02-06 20:44:21" level=error msg="Unable to prune directory" directory=0x9265c01e6d2fdf61df34b0b025b61a19a1040a78ff13748f634610d4342ac82d error="slot could not be read from blob file 3.ssz: EOF"
time="2024-02-06 20:44:25" level=error msg="Could not clean up dirty states" error="OriginBlockRoot: not found in db" prefix=state-gen
time="2024-02-06 20:44:56" level=error msg="Error encountered while warming up blob pruner cache." error="pruning failed for 1 root directories: blobs could not be pruned for some roots"

@prestonvanloon
Copy link
Member

Thanks @keithchew. The unable to prune directory issue is something we are debugging on your log is very helpful. It shouldn't be a problem at runtime and you could ignore it for now.

The workaround is to delete the directory $DATADIR/blobs/0x9265c01e6d2fdf61df34b0b025b61a19a1040a78ff13748f634610d4342ac82d

@prestonvanloon
Copy link
Member

prestonvanloon commented Feb 23, 2024

@keithchew Following up on this. We did find another bug where blobs were not being saved properly. The issue you mentioned #13557 (comment) has been resolved in #13648

Edit: #13648 stops the issue from happening again, but does not clear bad blobs from disk. Delete your disk and resync or delete any zero byte ssz files from your blobs directory to stop the log messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants