-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tpu-client: Speed up performance by detaching tasks #32844
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. r+ @lijunwangs approval
The same problem was addressed in #32388 and #32638 |
Yeah I was following those PRs. I think they're great too! We can also do both at once. This PR pretty much restores the old behavior of sending quickly.
Kind of, this restores the old behavior in
For sure, let's get some tests going! |
tpu-client/src/tpu_client.rs
Outdated
P: ConnectionPool<NewConnectionConfig = C> + 'static, | ||
M: ConnectionManager<ConnectionPool = P, NewConnectionConfig = C> + 'static, | ||
C: Sync + Send + 'static, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't these 'static
constraints be too restrictive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're required in order to detach the task unfortunately.
And looking at https://docs.rs/tokio/latest/tokio/task/fn.spawn.html, I'm not sure how to make the future T: Future + Send + 'static
without the inner types being 'static
-- do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be solved by using Arc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is around the types, not the instances of the types. I did some digging around, the requirement is pretty hard for now unfortunately tokio-rs/tokio#2385. Here's the error:
error[E0310]: the parameter type `C` may not live long enough
--> tpu-client/src/nonblocking/tpu_client.rs:356:9
|
356 | / ... task::spawn(async move {
357 | | ... try_send_wire_transaction(wire_transaction, &leaders, &connection_cache).a...
358 | | ... })
| |________^ ...so that the type `C` will meet its required lifetime bounds
|
help: consider adding an explicit lifetime bound...
|
344 | C: Send + Sync + 'static,
And it repeats for P
and M
.
Here's the simplest example that shows the issue:
use {
futures::future::join_all,
std::{fmt::Debug, time::Duration},
tokio::task::JoinHandle,
};
trait Printable: Debug {}
impl Printable for String {}
#[derive(Debug)]
struct MyStruct<T: Printable> {
_field: T,
}
async fn print_struct<T: Printable>(instance: MyStruct<T>) {
tokio::time::sleep(Duration::from_millis(10)).await;
println!("{instance:?}");
}
fn make_temp_task<T: Printable + Send + Sync + 'static>(instance: MyStruct<T>) -> JoinHandle<()
> {
tokio::task::spawn(print_struct(instance))
}
#[tokio::main]
async fn main() {
#[derive(Debug)]
struct MyTempStruct;
impl Printable for MyTempStruct {}
let tasks = (0..10).map(|_| make_temp_task(MyStruct { _field: MyTempStruct })).collect::<Ve
c<_>>();
join_all(tasks).await;
}
If make_temp_task
does not have T: Send + Sync + 'static
, this will not work because T
is not ensured to live long enough.
As a better option, however, I added the required bounds only for the new functions, so this way you'll only need the types to be 'static
if you're calling the detached
version. And I was able to relax Send + Sync
in a lot of places. Let me know what you think!
Codecov Report
@@ Coverage Diff @@
## master #32844 +/- ##
=========================================
- Coverage 82.0% 82.0% -0.1%
=========================================
Files 785 785
Lines 212096 212111 +15
=========================================
+ Hits 173982 173987 +5
- Misses 38114 38124 +10 |
tpu-client/src/tpu_client.rs
Outdated
P: ConnectionPool<NewConnectionConfig = C> + 'static, | ||
M: ConnectionManager<ConnectionPool = P, NewConnectionConfig = C> + 'static, | ||
C: Sync + Send + 'static, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be solved by using Arc?
/// Create a new client that disconnects when dropped | ||
pub fn new_with_connection_cache( | ||
rpc_client: Arc<RpcClient>, | ||
websocket_url: &str, | ||
config: TpuClientConfig, | ||
connection_cache: Arc<BackendConnectionCache<P, M, C>>, | ||
) -> Result<Self> { | ||
Ok(Self { | ||
tpu_client: BackendTpuClient::new_with_connection_cache( | ||
rpc_client, | ||
websocket_url, | ||
config, | ||
connection_cache, | ||
)?, | ||
}) | ||
} | ||
|
||
pub fn rpc_client(&self) -> &RpcClient { | ||
self.tpu_client.rpc_client() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just moved from the other impl
block
@@ -6,7 +6,7 @@ use { | |||
}; | |||
|
|||
#[async_trait] | |||
pub trait ClientConnection { | |||
pub trait ClientConnection: Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the only other required bounds in the end! Other than in the clients, of course
pub fn rpc_client(&self) -> &RpcClient { | ||
&self.rpc_client | ||
} | ||
|
||
pub async fn shutdown(&mut self) { | ||
self.exit.store(true, Ordering::Relaxed); | ||
self.leader_tpu_service.join().await; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions were moved from the other impl
block
I launched this branch vs testnet several times and got the following times Looks like it is close to what we had before in
|
This error's strange... did you see that with any other branch?
Ah yes, I bumped the number up to 1000 txs and now I'm starting to see it after many runs. Normally it's supposed to resend over RPC when the TPU fails, but this could be due to the error not getting propagated back up. I'll look into this, thanks for noticing.
This is unfortunately correct 😓 the percent only shows the "confirmed" transactions. Since we send everything and then confirm everything all at once, if everything is confirmed on your first check, you'll never see it change. |
this error message is probably misleading... the sync rpc client papers over any actual
yeah this behavior is almost always improper error or blockhash handling. so long as this isn't worse than 1.14 it's an improvement. remember that we need to keep this solution backportable to 1.16, so avoid feature creep.
dropping the percentage and instead displaying something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@joncinque haven't seen your comment
Do we plan to backport these changes to 1.16? |
Yes, but I have a less invasive fix which also addresses your error of Edit: ...and close this one |
Closing in favor of #32945 |
Problem
Program deployments have been slow starting in 1.16, as reported in #32595 and anecdotally by users. I was able to pinpoint this to the use of the full async API in 1.16, compared with the static async runtime in 1.14 which uses a form of
send_data_async
solana/quic-client/src/quic_client.rs
Line 78 in 0de8ccf
I'm not 100% sure of why it's slow, but I think it's because we're always waiting until sending to all leaders is done before moving to the next message, so we're limited by our slowest send, which can be up to 250ms on devnet.
Summary of Changes
This uses the idea from the 1.14 implementation, but in a full async context, by detaching sending packets into tasks that can all run concurrently.
I tested this against a few networks, with devnet being the slowest. Without this PR, sending 100 self-transfers took 18s-28s, and with this PR it takes 2.5-3.5s.
Here's the test binary that I ran, adapted from https://github.com/mvines/solana-cli-template. I didn't think it was worth adding this as a test, since local validators don't show the problematic behavior. Let me know if you think it should be done otherwise!
After this lands, we can go through and replace usages of
send_transaction
withtry_send_transaction_detached
to get speed back in other places too.Fixes #32595