RPC server not responsive possible regression in 1.9.16 #24644

sakridge · 2022-04-25T14:26:50Z

Problem

Sometimes the RPC server is completely non-responsive which is reported in 1.9.16

1.9.14 is reportedly ok.

Proposed Solution

Debug and fix.

CriesofCarrots · 2022-05-05T07:57:26Z

Bisecting points to the etcd-client bump in #24159. Reverted in v1.9 here: #25003
It is unlikely that it is the etcd-client bump itself that is the problem, but rather one of the dependency changes that came with it: tokio and parking_lot both seem worth a closer look

CriesofCarrots · 2022-05-05T21:54:25Z

This issue appears when upgrading tokio to v1.16+

CriesofCarrots · 2022-05-06T21:01:35Z

Reverting this tokio commit prevents the stalls: tokio-rs/tokio@4eed411

bonedaddy · 2022-05-18T23:11:35Z

Bisecting points to the etcd-client bump in #24159. Reverted in v1.9 here: #25003 It is unlikely that it is the etcd-client bump itself that is the problem, but rather one of the dependency changes that came with it: tokio and parking_lot both seem worth a closer look
This issue appears when upgrading tokio to v1.16+

Are you able to (if not provide instructions so i can) re-run the tests used to determine the tokio dependency issue was the problem while adding the parking_lot feature flag? As mentioned in the tokio docs unless the parking_lot flag is added, it's completely unused. Therefore no optimizations from parking_lot are enabled, so it would be interesting to re-run whatever tests were done while using the parking_lot flag.

fabioberger · 2022-08-04T12:56:48Z

Any update on this? This is preventing us from updating our Solana libraries.

CriesofCarrots · 2022-08-10T22:10:51Z

Any update on this? This is preventing us from updating our Solana libraries.

@fabioberger , just to close the circle on your post, non-rpc crates are now unpinned, which should enable your update: #26957

tomjohn1028 · 2022-09-05T21:51:53Z

I've got a dependency specifically on the solana-rpc crate and this is blocking the use of another 3rd party crate. Any update here?

bmuddha · 2022-09-19T15:13:05Z

I've got a dependency specifically on the solana-rpc crate and this is blocking the use of another 3rd party crate. Any update here?

Same here, any plans to upgrade tokio?

mschneider · 2022-12-21T10:46:55Z

The tokio maintainers are aware of the issue: tokio-rs/tokio#4873 They suspect though that it might be caused by bad usage of their library.

They requested that someone makes a simple demo program that demonstrates the usage.
A third party also mentioned a known workaround (just revert the individual commit that causes the issue).

@CriesofCarrots is someone still looking into one of these options? I could see how a solana forked tokio version could alleviate some of the short-term dependency issues.

CriesofCarrots · 2022-12-22T01:00:32Z

@mschneider , I don't think anyone has taken this issue up recently. I did confirm that the stalling issue was still happening with newer versions of tokio, up to v1.20.
We have forked dependencies in the past, though prefer to avoid it if possible. Would a fork actually solve the dependency issue? Or force downstream users to use our fork?

mschneider · 2022-12-22T14:03:39Z

we're looking into following up with 1 - the minimal repro. it looks that newer tokio versions in general are around 50% faster than 1.14 so there's a lot of value in moving towards a more recent release in the long-term.

I think a fork could cause some unforeseen issue, but this is kinda how I imagine it to work:

solana rpc packages uses a forked or simply outdated runtime, which clients can directly use. this is basically what's happening rn
clients can in addition use a modern tokio runtime, but these two will not be compatible, so they will need to use non-tokio libraries for message passing between those two runtimes. maybe @tomjohn1028 & @bobs4462 can chime in if this would work for their use-case

mschneider · 2023-01-11T13:29:56Z

We managed to reproduce the issue in a minimal example https://github.com/grooviegermanikus/solana-tokio-rpc-performance-issue/. It's very likely related to waiting for spin-locks inside tokio runtime, waiting for tokio’s native Mutex does not cause these issues and is recommended. This of course causes some issues as most of solana's code base right now is not and probably should not be dependent on tokio. @CriesofCarrots & @sakridge wdyt?

std::sync::Mutex
- tokio 1.14: 620 rps
- tokio 1.20: 240 rps
tokio::sync::Mutex
- tokio 1.14: 1928 rps
- tokio 1.20: 1903 rps

CriesofCarrots · 2023-01-12T06:51:50Z

This of course causes some issues as most of solana's code base right now is not and probably should not be dependent on tokio

Some of the relevant crates, like solana-core and solana-ledger, already depend on tokio. solana-runtime is probably the biggest concern.
Can you tell me or add a brief readme on the best way to see the effects in your example? @mschneider

mschneider · 2023-01-12T07:29:28Z

cargo run in each directory (rpc & test-client), you can edit the Cargo.toml of the rpc package to switch between tokio versions.

mschneider · 2023-01-31T23:45:29Z

@CriesofCarrots should we continue looking into this issue? If so, I would appreciate some guidance, so we can effectively help:

Are there any experiments / measurements we could do to help us form a better opinion on further course of action?
you mentioned you were bi-secting before, do you have a solid way to identify the issue in a solana build?
I'm asking because we don't, we just made the above minimal example.
If we had a way to verify the issue locally, we could try changes to the current code base and see where we end up. Right now there's no clear next step for us.

CriesofCarrots · 2023-02-01T19:07:36Z

Hi @mschneider . Unfortunately, my method for bisecting is not particularly generalizable... I have access to a node in the foundation RPC pool that evidences the stalls reliably, so I've been testing various solana-validator builds. It's a cumbersome process.

It seems like a big project to de-risk tokio locks in all the places they are used in JsonRpcRequestProcessor. I have been weighing whether that's worth it, given that we believe we need to remove RPC from the validator software anyway. But RPC removal is also a big project, and not one that will be done quickly (not even started). So I'm thinking that if there are one or two locks in JsonRpcRequestProcessor that are the most problematic, we could move those to tokio and update as a short-term fix.

As to your question of how you can help experiment, the greatest help would be any diagnosis you could offer on which locks are spinning when discernible RPC stalls occur. My gut says BankForks or ClusterInfo.

mschneider · 2023-02-02T08:42:42Z

Would it be possible to get a loadbalancer log / replay for this node? Could be a good way to measure rpc performance
of different code versions. We could also synthesize a torture workload, but I would prefer to optimize against a recorded real world workload. It is more effective engineering time wise and future proof.

This was referenced May 5, 2022

Pin tokio to LTS release v1.14 #25024

Merged

Rollback tokio to LTS release v1.14 #25025

Merged

Rollback tokio to LTS release v1.14 #25028

Merged

This was referenced May 13, 2022

Speedup bigtable block upload by factor of 8-10x #24534

Merged

Tokio Version In v1.10 Branch Should Update To Latest Available Version #25350

Closed

fabioberger mentioned this issue Aug 4, 2022

Performance degradation due to PR #4383 to reduce no-op wakeups in multi-threaded scheduler tokio-rs/tokio#4873

Open

CriesofCarrots mentioned this issue Jan 9, 2023

chore: bump tokio version #29586

Closed

mschneider mentioned this issue Jan 18, 2023

Internal Review: QUIC send back errors backport 1.14.12 blockworks-foundation/solana#20

Closed

steviez mentioned this issue Jan 25, 2023

RPC Service Locked Up #29902

Closed

This was referenced Feb 5, 2023

Use QUIC for repair #28636

Open

Use QUIC for gossip #28644

Open

mschneider mentioned this issue Mar 30, 2023

Replace deprecated crate jsonrpc with newer jsonrpsee #30911

Closed

acheroncrypto mentioned this issue Jun 3, 2023

Upgrade Solana to 1.16.0 coral-xyz/anchor#2512

Merged

CriesofCarrots mentioned this issue Jul 9, 2023

Bump and unpin tokio #32430

Merged

CriesofCarrots closed this as completed in #32430 Jul 13, 2023

afalaleev mentioned this issue Aug 23, 2023

v1.16: unpin tokio for solana client crate #32943

Closed

CriesofCarrots mentioned this issue Nov 28, 2023

Patch tokio to vendored version #34240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC server not responsive possible regression in 1.9.16 #24644

RPC server not responsive possible regression in 1.9.16 #24644

sakridge commented Apr 25, 2022 •

edited

Loading

CriesofCarrots commented May 5, 2022

CriesofCarrots commented May 5, 2022

CriesofCarrots commented May 6, 2022

bonedaddy commented May 18, 2022

fabioberger commented Aug 4, 2022

CriesofCarrots commented Aug 10, 2022

tomjohn1028 commented Sep 5, 2022

bmuddha commented Sep 19, 2022

mschneider commented Dec 21, 2022 •

edited

Loading

CriesofCarrots commented Dec 22, 2022

mschneider commented Dec 22, 2022 •

edited

Loading

mschneider commented Jan 11, 2023

CriesofCarrots commented Jan 12, 2023 •

edited

Loading

mschneider commented Jan 12, 2023 •

edited

Loading

mschneider commented Jan 31, 2023 •

edited

Loading

CriesofCarrots commented Feb 1, 2023

mschneider commented Feb 2, 2023

RPC server not responsive possible regression in 1.9.16 #24644

RPC server not responsive possible regression in 1.9.16 #24644

Comments

sakridge commented Apr 25, 2022 • edited Loading

Problem

Proposed Solution

CriesofCarrots commented May 5, 2022

CriesofCarrots commented May 5, 2022

CriesofCarrots commented May 6, 2022

bonedaddy commented May 18, 2022

fabioberger commented Aug 4, 2022

CriesofCarrots commented Aug 10, 2022

tomjohn1028 commented Sep 5, 2022

bmuddha commented Sep 19, 2022

mschneider commented Dec 21, 2022 • edited Loading

CriesofCarrots commented Dec 22, 2022

mschneider commented Dec 22, 2022 • edited Loading

mschneider commented Jan 11, 2023

CriesofCarrots commented Jan 12, 2023 • edited Loading

mschneider commented Jan 12, 2023 • edited Loading

mschneider commented Jan 31, 2023 • edited Loading

CriesofCarrots commented Feb 1, 2023

mschneider commented Feb 2, 2023

sakridge commented Apr 25, 2022 •

edited

Loading

mschneider commented Dec 21, 2022 •

edited

Loading

mschneider commented Dec 22, 2022 •

edited

Loading

CriesofCarrots commented Jan 12, 2023 •

edited

Loading

mschneider commented Jan 12, 2023 •

edited

Loading

mschneider commented Jan 31, 2023 •

edited

Loading