-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low attestations success rate #6807
Comments
This comment was marked as duplicate.
This comment was marked as duplicate.
not enough logs. do |
peer count is enough I got more than 100 incoming + outgoing connections on port 30303, where should I look to start troubleshoot why Erigon is so insufferably worse than Geth |
I have to restart it like 3 or 4 times a day so that it can progress, otherwise it keeps doing nothing and missing attestations. In testnet it runs fine tho |
Start from showing logs #6807 (comment) |
There is not much than what I printed on #6807 (comment) I'm afraid. it seems fine, then it stops processing everything until restarting. Here's a log from today, it was like 6h doing nothing and I had to restart it:
It keeps printing a mix of |
Restarted and after 3 hours, back to doing nothing again. Machine is largely idle, |
“ It keeps printing a mix of [p2p] and [txpool] over and ” it means it doesn’t do anything. So, need to find where it stuck… maybe you can do “pkill -SIGUSR1 erigon” if it stuck again |
It also possible that it doing pruning too long… try increase logs verbosity. |
thanks for the tips, will try them when it stops again |
Also, I noticed a pattern: Erigon slowly keeps sucking memory, and when it piles up near 9GB of RAM (anonymous, non-shared memory, non-reusable by the OS) it starts to get behind and in extreme cases to stop working at all until restarting. |
This comment was marked as duplicate.
This comment was marked as duplicate.
I have 32 Gb of RAM, but I run other things in the server. Server load average hovers around 1.5 - 2 for a quad-cpu server. CPU definitely is not a bottleneck. For a 32 Gb RAM server, Erigon taking 9Gb resident anon memory slice, during normal operation is a definitely a no-go. The reason I came to Erigon is because geth was using 4Gb and I wanted to reduce it |
|
This comment was marked as duplicate.
This comment was marked as duplicate.
to understand where RPCDaemon wasted RAM can do: |
also post some logs when you see much RAM to understand what erigon doing in this time |
That is perfectly fine if it needs this amount of memory but then this should be documented. Currently it is written |
This comment was marked as duplicate.
This comment was marked as duplicate.
Erigon does have 16 Gb available. From the 32 Gb installed, I need about ~10 Gb for other things, thus it leaves about 20Gb for Erigon. It used to run fine, but as I mentioned in the top, since months it started struggling. By the way, the top output I sent above was customised to include the RSAN (resident anonymous) memory: memory that is private to the process and not shared and managed by the OS as page cache. You are saying this number is wrong? |
Here's the output. This is for testnet, it is stuck since a couple of hours not printing anything (apart from p2p and txpool stats) in the logs but not progressing too:
|
This comment was marked as duplicate.
This comment was marked as duplicate.
This doesn't make any sense lol |
This comment was marked as duplicate.
This comment was marked as duplicate.
Your theory doesn't make any sense whatsoever. How on earth it would be able to run Erigon non-single thread phases and still have spare memory and CPU and at the same time be so miserably slow to fail to process phase 7 timely, and at the same time be able to handle all other things I run in the machine without sweating lol |
This comment was marked as duplicate.
This comment was marked as duplicate.
definitely not, please stop spamming the thread
The go profile was already posted above #6807 (comment) |
Doesn't matter what I am, but you are definitely a wannabe poster of random things in GitHub issues to brag about "being an open source contributor" lol |
This comment was marked as duplicate.
This comment was marked as duplicate.
Bro, just leave the thread, your nonsensical comments are making it hard for developers to find the information the matters |
You are right. I will hide the other comments and sum-up here. In the mean time, you should try to understand why those cycles are so long in the first place and bring them down if you can. Commit cycles are about validating new blocks and updating the state accordingly (playing the transactions). That is 1/ storage bound because there is a lot of random read/write in the state and 2/ CPU bound for the signatures verification (and maybe the hashes?). Storage
Notice how there is nothing going on, and suddenly on a commit cycle the node fire ~9000 IO (629 reads + 8602 writes).
iostat is telling 9000 IOPS because it counted 9000 IOs in the 1 sec timeframe of the report, but in fact it is rather 9000 IO in the less than 100ms of my commit cycle, so closer to 100000 IOPS in my case. Also notice there is barely any read IO, mainly writes => that's because the state is well cached in-memory (cached memory), so there is very few bytes that need to be pulled out of storage. RAM CPU Regarding the fact that it was fine before and started to misbehave suddenly, I understand the urge to blame software changes, but there is still hundreds of Erigon node performing as expected so... But Ethereum load do change over time, with more and more complex transactions for smart-contract that performs a lot of state mutation. So if your hardware was just enough in the past, it may well not be enough anymore now. Hope it makes sense for you, I tried my best to be clear and precise in the explanation. |
On tip of chain: execution takes ~10% of block processing time. Also i suggest discuss exact numbers (from logs or metrics) instead of guessing wheree is the bottleneck. |
Your PR’s are welcome |
@gus4rs is see in goroutines: 1 stops at SendPayloadStatus if yes: i guess it’s some deadlock in consensus layer. Do you run without —externalcl flag? |
@AskAlexSharov Here's another dump: my mainnet Erigon is deadlocked for 3 hours now. That other dump was from testnet, that was stuck too.
|
@AskAlexSharov I use an external Nimbus consensus and run with externalcl . You can see all the flags I use here #6807 (comment) |
@AskAlexSharov Restarting nimbus didn't unstuck Erigon. The only way to "fix" it is to restart Erigon itself |
did you miss space here |
Yes, it got cut when pasting, Here's my full launch cmd:
|
Wasn't writing about Erigon, but to the suggestion that execution stage could be a bottleneck in my case |
No, profiling shows 2.5gb and picture looks expected. see also: https://github.com/ledgerwatch/erigon#htop-shows-incorrect-memory-usage |
The memory usage is a "tangential" issue that I suspect may be causing Erigon to get stuck constantly in my server since December/22, but I think assuming the go profile output is equal to the physical memory usage is not telling the whole story. Right now, To see who is right, I can actually go and print all mappings from the Erigon process by doing
You can observe the last column will show if the area is mapped to a file or not, and if you sum all tl:dr: I believe Erigon is using a ton of memory and my stats are correct |
Even if you were right, how exactly would that explain anything? I never heard of a program entering a deadlock because it uses too much memory. OOM exceptions do happen because of Erigon if you don't have enough RAM, but that's not what you see. Erigon slowing down other process because it eats all the RAM could make some sense but Erigon halting because it has too much ram does not make any. Again, you have commit cycles literally 50 times slower than normal. You have to realize that your node will never be able to produce timely attestations if your Execution Layer takes 8 seconds to validate 3 blocs.
All your timings are kind of high too (nearly 2 seconds to update indexes, I mean nothing bother you here?).
Provided with enough resources, Erigon is running like clockwork. Your node is lacking resources, that seems kind of obvious. |
|
|
OS will always use 100% of available RAM for PageCache of Erigon's files - because Erigon files >> RAM. So, it's expected. But this RAM is owned by OS and OS can free it when need without noticing Erigon. It's ok. Also this cache will survived restart of Erigon. In logs: Deadlock: i think it's not related with RAM. Also maybe it's deadlock inside Nimbus - because |
Tks, will give it a go
Fair enough, I just brought memory to attention because I noticed it has been increasing in usage since a few months ago, and when it hangs, it coincides with > 8.0 Gb being used.
Right, but I did try to restart Nimbus when Erigon is deadlocked, to no avail. I will try to gather a thread dump for it
|
There could be a deadlock in Erigon between the lock of the in turbo/stages/headerdownload/header_algos.go:1273 (inside the go routine of the PoSDownloader)
The PoSDownloader take the in turbo/engineapi/request_list.go:202
Let's imagine that the CL (Nimbus) has stopped querying the Engine API because it considers Erigon unsynced (I don't know if it is actually its behavior), then the Engine API may be in the in turbo/engineapi/request_list.go:129
That would mean that the PoSDownloader is waiting for the engine API to free the syncCond lock while keeping the lock on the This theory could be compatible with @AskAlexSharov observation that your StageLoop seemed stuck into the At this point, having DEBUG logs of what happen just before the node get stuck could be helpful to better understand the code paths your node take. |
Didn't look in code yet. But i remember we recently worked on |
Definitely sounds plausible @crypto7world. The work around WaitForRequest did have some flaws in it when it went out originally. Hive tests would sometimes lock out, but this has been resolved now so could remedy the problem you're seeing if it's related to that. The work is in devel now if you wanted to try that if you haven't done so already. In the code above I'd imagine the deferred unlock of the syncCond in |
I don't think it is related, the locks involved in this theory are there for a very long time and have not been touched recently. @hexoscott I forgot to checkout before analyzing the code so I got what I saw from While I don't fully understand exactly why, obviously this deadlock do not happen under normal performing conditions. But in @gus4rs logs, we can see something that feel wrong to me (#6807 (comment)). That is what Erigon logged just before getting stuck:
This is logged when Erigon is in
So the node received block 0x1307... as a payload and is in I have a feeling that should not happen under optimal conditions. And I also have a strong feeling it is somehow related to the weird deadlock theory. But I'm not familiar enough with Go or Erigon architecture and I cannot figure out exactly how (or if) it is related. DEBUG logs are the next step, unless one of you know Erigon enough to assess with only this information. |
Thanks for the extra info there. I'll be free to take a deep dive on this Monday next week. Will see if I can create a test scenario that replicates the problem and we can move from there on the fix. |
I solved my issue moving to
As a long time user of Erigon (I started using it even before the beacon chain even existed), I can say you guys are doing a stellar job in the archive node, but have been dropping the ball for the small home staker by becoming a memory and disk hog lately. Maybe it's time to stop wasting time in garbage like BSC (this has no future whatsoever) and pay more attention to the little Ethereum guys ;) Also, please set the right expectations in the documentation, it does not need only 1Gb of memory and the go memory profile is bogus All the best! |
happy to hear that you solved your problem |
Using latest released version 2.38.1, but it has been happening since months.
Attestations success is weak, my validator sits in the bottom 98% of all mainnet validators. No evident errors in the logs, it just stops being in sync.
Here's the most recent logs:
It is currently 1h without producing any attestation.
The consensus client (nimbus) is OK, just waiting for erigon
Last but not least, the server has a NVMe SSD with 4 CPUs and 1.15 load average, also 4GB free memory. Erigon is eating 8Gb of non shared anonymous memory.
The text was updated successfully, but these errors were encountered: