-
Notifications
You must be signed in to change notification settings - Fork 20.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FYI - Sync mode SNAP does not work #23228
Comments
Compacting means geth is writing data faster than leveldb can handle, so it pushes back a bit. It's not a fault, it's the fact that snap sync delivers data a lot faster than 'fast'.
|
The nodes-count is increasing -- it's progressing.
|
The unexpected trienode heal packet is that normal? That is like 90% of my logs at this time, if it is normal, should we make it into an "INFO" instead. |
No, that's not an expected error (not that many of them)
|
This is my second time doing a fresh sync, and I am getting many
|
I checked again, 93% of my logs are “Unexpected trienode heal packet” if there is anything i can do to help gather more logs for understanding how to fix it, let me know. |
The stage State heal in progress takes several hours depending on your hardware specs. How much CPU core have you? What is the network bandwidth? |
I have 4 CPUs, 16GB RAM, 2Gbit network on Azure VM https://azure.microsoft.com/en-us/blog/introducing-b-series-our-new-burstable-vm-size/ (Standard_B4ms) = $140/month |
AWS c6g.xlarge. That's 4 high-performance cores, 16GiB, and I added a 6000 IOPS, 1000MiB/s throughput EBS volume, up to 10 Gb network. One night of snapsync.
Same result as @mohamedmansour (which also hey, love watchtheburn ❤️ ! happen to be building something v. similar). If I'm reading the resource use correctly (everything way under capacity, chain data 376GiB large only), but the node count going up, this node is now completely constrained by how quickly it can get the data it needs from other nodes? From the looks of it, I can go down to a machine half this size. Sounds like what @mohamedmansour and I are seeing is perhaps not entirely expected but unrelated to the bug flagged in this issue. Correct? Unrelated to the issue, if it helps anyone, the machine is mostly CPU constrained during the sync (fascinating!), sits at 50% RAM use, IO I reserved was waaaaay overkill didn't cross 3000 IOPS, 180MiB/s. |
Heh, doing fast sync now, the machine above is averaging 80% CPU, 80% Mem, IO 100 - 250MB/s, network 200Mb/s, it's basically on fire 🔥 . Hats of to the Geth team for the efficiency of snap sync 👌 . lmk if I should open a new issue if fast sync ends up beating out snap sync 😬 . |
I am doubling my cloud now to 20K IOPS and will see how state healing does when it comes to that stage. Bare Metal VMs is going to be the solution perhaps, I wish Geth honesty places logs saying hardware is not good enough if it can detect that. |
Snap sync done 👌 . 3 days it looks like. Surprisingly only 384G disk use. Seems to work fine for me with the above machine 🙌 . |
Snapsync took 1 day on AWS c6g.xlarge machine, while it may run 3 days on the AWS a1.xlarge, which is only 20-30% lower in CPU cores. 4 high-speed (not low end) cpu cores are a bare minimum for the last Geth version. |
What is the "target" number of accounts/slots/codes/nodes, currently? Interested to try to estimate my ETA ;-) |
FWIW: My last |
So is there some metric we can look at to understand (absolute) sync progress? Some target number it needs to hit? Which? Thank you! |
I believe snap sync requires at least 6 vcpus and ssd with high iops to work. |
This might be related:
|
Is the "accounts" number supposed to just keep increasing? I noticed in the logs that it reset itself to 0 several times in the last 6 days, I think whenenver the node was restarted. So-what so far:
This is what it shows during shutdown ... I'm not sure why it's not considered a clean shutdown, I send it gentle signals. Just the default stuff that systemd sends.
Once it resumes, I see this:
and then it restarts here:
|
I'm experiencing this same issue. Same setup as #23191 (comment) (RaspberryPi 8Gb system with SSD) running the latest Etherium on Arm build. Geth version info:
Using the default snap sync and the log is full of "Unexpected trienode heal packet" entries.
|
I switched to fast sync (maintaining the already downloaded data), but it was still not done after 2-3 days. Thus, I rsync'd the whole geth folder to an aarch64 machine and completed the snap sync there, which took 1-2 days. Then I rsync'd everything back and it's running smoothly on the Pi now. Not ideal, but worked for me. |
How did u go back to fast without abandon the data u have downloaded, whats the command like? thx! |
if my state healing looks like |
Raspberry Pi with 4 Core is probably not capable of doing snapshot sync. Try to disable this feature and use regular sync. |
If this can help someone I had this issue and in my case it was because I started geth from the config.toml file and the node was overwriting the cache attribute to default. Starting it with --cache xxx --config /path/to/config solved it for me. |
Trying to do a snap sync on a recent Ryzen machine (not just a Pi, it should be performant enough) and it seems to be sitting at this for over a day now.
It seems to have imported receipts and headers to tip (or near to it, at least)
|
@holiman How high does the nodes-count need to go? I'm at ".. nodes=86,597,863" |
I'd also like to know this.. I'm (sadly) only on nodes=23,591,000. Wondering if I should let the state heal complete or switch to fast sync as it seems others have had success. Thank you for all the hard work you folks put in!! |
My Geth 1.10.16 node running snap sync mode also reports regular "State heal in progress" and a frequent stream of repeated "Unexpected trienode heal packet" log messages. It also reports regular "Snapshot extension registration failed" messages. I don't recall seeing these messages when I was running Geth 1.10.15. I switched Geth's sync mode from "fast" to "snap" starting with Geth 1.10.15 because "fast" sync mode was no longer available. |
After downgrading Geth from 1.10.16 to 1.10.15, Geth still reports frequent "Snapshot extension registration failed" and "State heal in progress" log messages, but reported no "Unexpected trienode heal packet" log messages until after I pressed Ctrl-c to shut down the node. |
I am waiting for more than 1 day after a snap sync, in a beefy machine with beefy internet connection |
If this "heal" is gonna "fix" 175M Ethereum accounts (it is now on 103774), this will take more than a month to complete, what exactly is it going to go? |
totally concur, if you are providing an app bound to hardware you must benchmark the hardware before you start the app, otherwise your app won't work and since you are the one responsible for providing the app you are also guilty of letting the user running the app on a slow hardware (implicitly). Ethereum team has been giving us low quality engineering since 2015 |
The state heal process is a little deceiving. I was at first worried that my Raspberry Pi 4B + 2TB SSD was having trouble syncing as during state heal the accounts, slots, and nodes were going up slowly relative to how many accounts/slots there are in total. My state heal finished in about 21 hours at 1,182,357 accounts, 2,327,247 slots, and 12,186,486 nodes. I also had many instances of |
That's good to know @leontodd . But the question remains - how do I get any indication on the progress/ETA of the state heal progress? |
I would like to know this too. Output looks like this:
"pending" normally rises until it is somewhere between 110,000 and 130,000. Then a bunch of "Unexpected trienode heal packages" lines appear, and "pending" is restarted at 1. It is rare that this happens with "pending" being below 20k like above. I also get lines telling me that geth doesn't like nimbus. (It is running and fully synced. Connection to geth, according to nimbus output, works) What really puzzles me is the output And I am really anxious about the disk space used. This goes beyond anything I read online an eth1 node would need:
(But it has not been growing much in the last 2 weeks.) This is my current sync state, and while there is some progress, it has been minimal over the past 3 weeks.
I am really thinking about starting from scratch... |
If it's been trying to heal for Now, while the state is healing, blocks are also progressed. So the state keeps moving. Therefore, every ~192 blocks or so, we need to move the 'pivot point' along, to target a more recent state.
Whenever we do so, we need to start from the root again, and go downwards while healing the trie. That is why But, if the blockchain state changes more rapidly than you are able to heal, then you will never catch up. These are the things that factor in:
|
Thanks for the explanation!
This all lets me think that there must be something wrong within my database which causes the heal to stall and never to catch up. Best would be to throw it all away I guess... 😒 It would be nice if |
I ran I tried this again but this time only removed the local database, and kept the ancient database. This time when resyncing the state heal went on for days without completing. I suspect that there are too many states that need healing when resyncing from the ancient database, especially on slower hardware. This fits with your endless state heal @Yamakuzure as you resynced by rewinding the chain. I don't know much about the technical aspects of the state heal process, but resyncing from scratch has definitely helped in my case. |
Possible fix: #25651 . |
Oh, sorry, I broke off the (super-slow) rewinding, as a target of 0 meant an
No. |
If your healing is taking forever, this is how I solved it after weeks of trying:
|
This issue seems to be quite stale now. Snap sync has improved significantly since this issue was created. Compaction while sync is still an issue, but not as bad as it was and it will be even less of a problem once we move to pebble. I am going to close this now. Thank you for reporting! |
I run several geth nodes, but I've always used Sync mode fast. I am building a new machine, so I figured I would try snap (the new default). This is running on a HP DL 380 G5 with 24 GB of RAM and a 1TB SSD.
Using SNAP sync, geth will go into these modes of compacting the database and it completely halts the sync process. It does this for over an hour sometimes. But, I let it continue. Now, this morning my sync appears to be stuck in "state heal in progress". It looks like it's been at this for hours. I give up. I am going back to Fast.
The text was updated successfully, but these errors were encountered: