-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RPC server not responsive possible regression in 1.9.16 #24644
Comments
This issue appears when upgrading tokio to v1.16+ |
Reverting this tokio commit prevents the stalls: tokio-rs/tokio@4eed411 |
Are you able to (if not provide instructions so i can) re-run the tests used to determine the tokio dependency issue was the problem while adding the |
Any update on this? This is preventing us from updating our Solana libraries. |
@fabioberger , just to close the circle on your post, non-rpc crates are now unpinned, which should enable your update: #26957 |
I've got a dependency specifically on the |
Same here, any plans to upgrade tokio? |
The tokio maintainers are aware of the issue: tokio-rs/tokio#4873 They suspect though that it might be caused by bad usage of their library.
@CriesofCarrots is someone still looking into one of these options? I could see how a solana forked tokio version could alleviate some of the short-term dependency issues. |
@mschneider , I don't think anyone has taken this issue up recently. I did confirm that the stalling issue was still happening with newer versions of tokio, up to v1.20. |
we're looking into following up with 1 - the minimal repro. it looks that newer tokio versions in general are around 50% faster than 1.14 so there's a lot of value in moving towards a more recent release in the long-term. I think a fork could cause some unforeseen issue, but this is kinda how I imagine it to work:
|
We managed to reproduce the issue in a minimal example https://github.com/grooviegermanikus/solana-tokio-rpc-performance-issue/. It's very likely related to waiting for spin-locks inside tokio runtime, waiting for tokio’s native Mutex does not cause these issues and is recommended. This of course causes some issues as most of solana's code base right now is not and probably should not be dependent on tokio. @CriesofCarrots & @sakridge wdyt?
|
Some of the relevant crates, like |
|
@CriesofCarrots should we continue looking into this issue? If so, I would appreciate some guidance, so we can effectively help: Are there any experiments / measurements we could do to help us form a better opinion on further course of action? |
Hi @mschneider . Unfortunately, my method for bisecting is not particularly generalizable... I have access to a node in the foundation RPC pool that evidences the stalls reliably, so I've been testing various It seems like a big project to de-risk tokio locks in all the places they are used in JsonRpcRequestProcessor. I have been weighing whether that's worth it, given that we believe we need to remove RPC from the validator software anyway. But RPC removal is also a big project, and not one that will be done quickly (not even started). So I'm thinking that if there are one or two locks in JsonRpcRequestProcessor that are the most problematic, we could move those to tokio and update as a short-term fix. As to your question of how you can help experiment, the greatest help would be any diagnosis you could offer on which locks are spinning when discernible RPC stalls occur. My gut says BankForks or ClusterInfo. |
Would it be possible to get a loadbalancer log / replay for this node? Could be a good way to measure rpc performance |
Problem
Sometimes the RPC server is completely non-responsive which is reported in 1.9.16
1.9.14 is reportedly ok.
Proposed Solution
Debug and fix.
The text was updated successfully, but these errors were encountered: