Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC server not responsive possible regression in 1.9.16 #24644

Closed
sakridge opened this issue Apr 25, 2022 · 17 comments · Fixed by #32430
Closed

RPC server not responsive possible regression in 1.9.16 #24644

sakridge opened this issue Apr 25, 2022 · 17 comments · Fixed by #32430

Comments

@sakridge
Copy link
Member

sakridge commented Apr 25, 2022

Problem

Sometimes the RPC server is completely non-responsive which is reported in 1.9.16

1.9.14 is reportedly ok.

Proposed Solution

Debug and fix.

@CriesofCarrots
Copy link
Contributor

Bisecting points to the etcd-client bump in #24159. Reverted in v1.9 here: #25003
It is unlikely that it is the etcd-client bump itself that is the problem, but rather one of the dependency changes that came with it: tokio and parking_lot both seem worth a closer look

@CriesofCarrots
Copy link
Contributor

This issue appears when upgrading tokio to v1.16+

@CriesofCarrots
Copy link
Contributor

Reverting this tokio commit prevents the stalls: tokio-rs/tokio@4eed411

@bonedaddy
Copy link

Bisecting points to the etcd-client bump in #24159. Reverted in v1.9 here: #25003 It is unlikely that it is the etcd-client bump itself that is the problem, but rather one of the dependency changes that came with it: tokio and parking_lot both seem worth a closer look
This issue appears when upgrading tokio to v1.16+

Are you able to (if not provide instructions so i can) re-run the tests used to determine the tokio dependency issue was the problem while adding the parking_lot feature flag? As mentioned in the tokio docs unless the parking_lot flag is added, it's completely unused. Therefore no optimizations from parking_lot are enabled, so it would be interesting to re-run whatever tests were done while using the parking_lot flag.

@fabioberger
Copy link

Any update on this? This is preventing us from updating our Solana libraries.

@CriesofCarrots
Copy link
Contributor

Any update on this? This is preventing us from updating our Solana libraries.

@fabioberger , just to close the circle on your post, non-rpc crates are now unpinned, which should enable your update: #26957

@tomjohn1028
Copy link
Contributor

I've got a dependency specifically on the solana-rpc crate and this is blocking the use of another 3rd party crate. Any update here?

@bmuddha
Copy link

bmuddha commented Sep 19, 2022

I've got a dependency specifically on the solana-rpc crate and this is blocking the use of another 3rd party crate. Any update here?

Same here, any plans to upgrade tokio?

@mschneider
Copy link
Contributor

mschneider commented Dec 21, 2022

The tokio maintainers are aware of the issue: tokio-rs/tokio#4873 They suspect though that it might be caused by bad usage of their library.

  1. They requested that someone makes a simple demo program that demonstrates the usage.
  2. A third party also mentioned a known workaround (just revert the individual commit that causes the issue).

@CriesofCarrots is someone still looking into one of these options? I could see how a solana forked tokio version could alleviate some of the short-term dependency issues.

@CriesofCarrots
Copy link
Contributor

@mschneider , I don't think anyone has taken this issue up recently. I did confirm that the stalling issue was still happening with newer versions of tokio, up to v1.20.
We have forked dependencies in the past, though prefer to avoid it if possible. Would a fork actually solve the dependency issue? Or force downstream users to use our fork?

@mschneider
Copy link
Contributor

mschneider commented Dec 22, 2022

we're looking into following up with 1 - the minimal repro. it looks that newer tokio versions in general are around 50% faster than 1.14 so there's a lot of value in moving towards a more recent release in the long-term.

I think a fork could cause some unforeseen issue, but this is kinda how I imagine it to work:

  1. solana rpc packages uses a forked or simply outdated runtime, which clients can directly use. this is basically what's happening rn
  2. clients can in addition use a modern tokio runtime, but these two will not be compatible, so they will need to use non-tokio libraries for message passing between those two runtimes. maybe @tomjohn1028 & @bobs4462 can chime in if this would work for their use-case

@mschneider
Copy link
Contributor

We managed to reproduce the issue in a minimal example https://github.com/grooviegermanikus/solana-tokio-rpc-performance-issue/. It's very likely related to waiting for spin-locks inside tokio runtime, waiting for tokio’s native Mutex does not cause these issues and is recommended. This of course causes some issues as most of solana's code base right now is not and probably should not be dependent on tokio. @CriesofCarrots & @sakridge wdyt?

  • std::sync::Mutex

    • tokio 1.14: 620 rps
    • tokio 1.20: 240 rps
  • tokio::sync::Mutex

    • tokio 1.14: 1928 rps
    • tokio 1.20: 1903 rps

@CriesofCarrots
Copy link
Contributor

CriesofCarrots commented Jan 12, 2023

This of course causes some issues as most of solana's code base right now is not and probably should not be dependent on tokio

Some of the relevant crates, like solana-core and solana-ledger, already depend on tokio. solana-runtime is probably the biggest concern.
Can you tell me or add a brief readme on the best way to see the effects in your example? @mschneider

@mschneider
Copy link
Contributor

mschneider commented Jan 12, 2023

cargo run in each directory (rpc & test-client), you can edit the Cargo.toml of the rpc package to switch between tokio versions.

@mschneider
Copy link
Contributor

mschneider commented Jan 31, 2023

@CriesofCarrots should we continue looking into this issue? If so, I would appreciate some guidance, so we can effectively help:

Are there any experiments / measurements we could do to help us form a better opinion on further course of action?
you mentioned you were bi-secting before, do you have a solid way to identify the issue in a solana build?
I'm asking because we don't, we just made the above minimal example.
If we had a way to verify the issue locally, we could try changes to the current code base and see where we end up. Right now there's no clear next step for us.

@CriesofCarrots
Copy link
Contributor

Hi @mschneider . Unfortunately, my method for bisecting is not particularly generalizable... I have access to a node in the foundation RPC pool that evidences the stalls reliably, so I've been testing various solana-validator builds. It's a cumbersome process.

It seems like a big project to de-risk tokio locks in all the places they are used in JsonRpcRequestProcessor. I have been weighing whether that's worth it, given that we believe we need to remove RPC from the validator software anyway. But RPC removal is also a big project, and not one that will be done quickly (not even started). So I'm thinking that if there are one or two locks in JsonRpcRequestProcessor that are the most problematic, we could move those to tokio and update as a short-term fix.

As to your question of how you can help experiment, the greatest help would be any diagnosis you could offer on which locks are spinning when discernible RPC stalls occur. My gut says BankForks or ClusterInfo.

@mschneider
Copy link
Contributor

Would it be possible to get a loadbalancer log / replay for this node? Could be a good way to measure rpc performance
of different code versions. We could also synthesize a torture workload, but I would prefer to optimize against a recorded real world workload. It is more effective engineering time wise and future proof.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants