Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransactionView: ReceiveAndBuffer #3820

Merged
merged 30 commits into from
Jan 14, 2025

Conversation

apfitzge
Copy link

Problem

  • Need to ingest new tx type into banking stage

Summary of Changes

  • Implement a new TransactionViewReceiveAndBuffer to allow banking stage to ingest TransactionView-based transactions
  • Add argument to BankingStage::new to switch between transaction representations
  • Hook up CLI for validator and relevant benchmarks
  • Run all BankingStage and SchedulerController tests using both options

Fixes #

@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch 4 times, most recently from 2876a3f to c868ff5 Compare November 27, 2024 17:17
type Transaction = RuntimeTransaction<ResolvedTransactionView<Bytes>>;
type Container = TransactionViewStateContainer;

fn receive_and_buffer_packets(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this does not run the bank.check_transactions because I prioritized safety around not leaking Bytes in the container, which is much trickier if we do batching.
We should probably follow this up with a call to clean immediately after, so that we check age on transactions we actually inserted, which is likely better anyway. However, I'd like to profile with and without that change first.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also should make sure we are checking the status cache to make sure we're not scheduling transactions that have already been processed right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bank::check_transactions checks the age and status cache. so it'll clean out both types before we get to the processing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry my comment wasn't clear. I bring it up because it's kind of a bug to not check status cache before scheduling right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do check before scheduling. I'm just not checking before buffering in this PR.

I think with some of the other clean ups we did during review it may be easier to add in. Let me give it a shot this afternoon, and I'll comment here on what I got.

Copy link
Author

@apfitzge apfitzge Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, taking a look at this. I think we can do it relatively easily during buffering; but it may require some refactoring to do it in the optimal point.

We'd really want to check age/status immediately after we calculate the message hash; but the type we have at that point RuntimeTransaction<SanitizedTransactionView> doesn't implement SVMMessage (addresses not resolved).
That way we don't need to reach into accounts-db (slower) before we discard too old or already processed txs.

I think it's probably better to refactor the internals of check_transactions so we can call it as long as we have the signature/message hash (for SVMMessage we can have fn that just grabs those and calls the actual impl).

^ Given those additional changes, think will leave it as a follow-up PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me

@apfitzge apfitzge marked this pull request as ready for review December 2, 2024 18:50
@apfitzge apfitzge requested a review from a team as a code owner December 2, 2024 18:50
@apfitzge apfitzge self-assigned this Dec 2, 2024
@apfitzge apfitzge requested a review from bw-solana December 2, 2024 20:23
if container.is_empty()
&& matches!(
decision,
BufferedPacketsDecision::Forward | BufferedPacketsDecision::ForwardAndHold
Copy link
Author

@apfitzge apfitzge Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed Hold branch here. Sometimes after a leader slot, we do not have a bank ready in Consume for our next leader slot. We should still avoid the recv_timeout in that case. Also added a check that container is not empty - if container has items, then we should try to process them instead of sleeping for packets

@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch from f87bd2f to 2d9cc88 Compare December 3, 2024 17:54
Copy link

@pgarg66 pgarg66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from SVM point of view.

@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch from 2d9cc88 to e310dd7 Compare December 5, 2024 17:06
@apfitzge apfitzge marked this pull request as draft December 5, 2024 22:06
@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch from 4e927a9 to a8c7a41 Compare December 6, 2024 14:00
@apfitzge apfitzge marked this pull request as ready for review December 6, 2024 15:29
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major qualms with what you have. Left a few comments to consider

core/src/banking_stage.rs Outdated Show resolved Hide resolved
core/src/banking_stage.rs Show resolved Hide resolved
validator/src/main.rs Show resolved Hide resolved
Copy link

@jstarry jstarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing but posting some feedback along the way

@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch from 6ccd99a to 19fb288 Compare December 9, 2024 23:13
fn receive_and_buffer_packets(
&mut self,
container: &mut Self::Container,
timing_metrics: &mut SchedulerTimingMetrics,
count_metrics: &mut SchedulerCountMetrics,
decision: &BufferedPacketsDecision,
) -> bool;
) -> Result<usize, ()>;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was to improve the testing which was seeing some inconsistent failures, due to timing out in this call.

Those tests now loop until this call returns Ok(0) - or an error.

Natural question - Why did this only pop up for TransactionView?

The receive_and_buffer timeout for view is done differently than for sdk types.
The sdk types have a receive timeout, i.e. a time limit to receive packets, and then will deserialize all packets it received regardless of the time that takes.

The view type has an overall timeout so that it does not take longer than specified time for all operations - receive AND parsing.

Since receiving takes almost no time/work, this consistently would receive all the batches sent in the sdk variant. and then it would take however long to deserialize them - whether or not OS scheduler decides to rug the thread doesn't matter. If the view variant of the test gets rugged during parsing, then it can time out and not receive the next message.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have different timeout implementations between the two transaction structs? It feels more correct to have the timeout apply to both receive and buffer but it will likely mean that we buffer fewer transactions before processing right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan on also changing the sdk approach to be timed out on receive AND buffer, since I agree this definitely seems more correct...but only after forwarding and ImmutableDeserializedPacket is removed; all the current receiving code needs to be reworked but doesn't make sense to rework it until IDP is dead imo.

But i've held off removing forwarding on getting this merged first, since when i posted there was some other stuff that needed to get merged for forwarding removal; in hindsight it may have been the wrong order.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the context. So is it ok that we start processing transactions without buffering so many? Seems like the main loop in the scheduler controller is fairly lightweight besides doing work for the pre_graph_filter inside process_transactions. If we end up running more shorter scheduler loops we will still be putting later received transactions into the top of the priograph if they pay high pri fees right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vast majority of packets we receive come in slightly before we become leader.

I had a heatmap of this at one time, and things may have changed. (I should probably grab some trace data and re-generate it!)
There used to be a small steady-state of packets coming in in the minute before the first leader slot. Then around 5-10 slots before becoming leader theres a huge spike of traffic, but it fell off very quickly with much fewer packets arriving during our leader slots.

All to say, it should be fine to run shorter calls to schedule if our loop is fast enough at returning to it. We essentially always want to keep some amount of transactions queued to each worker thread once we are leader, so the calls between schedule should just be enough that the scheduler is able to keep up with how fast the workers process and queue more up as they complete the work.

@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch from 2b8807d to d579b84 Compare December 11, 2024 18:32
@apfitzge apfitzge requested a review from jstarry December 11, 2024 20:46
@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch 2 times, most recently from 05779cd to f2c9d44 Compare December 13, 2024 16:26
@apfitzge
Copy link
Author

@tao-stones @jstarry if you are back from vacation(s) would really appreciate review on this.
Somewhat significant changes to banking and I've been holding up some other stuff until this gets merged to avoid having 100 conflicts

@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch from f2c9d44 to cee2ff5 Compare January 2, 2025 15:52
tao-stones
tao-stones previously approved these changes Jan 2, 2025
Copy link

@tao-stones tao-stones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

core/src/banking_stage/packet_filter.rs Show resolved Hide resolved
@apfitzge apfitzge force-pushed the new_receive_and_buffer_2024_11_26 branch from 4c4bae1 to 7fe132f Compare January 9, 2025 20:02
@apfitzge apfitzge requested a review from a team as a code owner January 9, 2025 20:02
@apfitzge
Copy link
Author

apfitzge commented Jan 9, 2025

@jstarry - rebased to fix merge conflicts with #4123

@apfitzge apfitzge requested a review from jstarry January 13, 2025 15:21
@apfitzge apfitzge requested a review from pgarg66 January 14, 2025 13:54
@apfitzge
Copy link
Author

@pgarg66 adding you to review for @anza-xyz/svm

@apfitzge apfitzge merged commit 535ea1e into anza-xyz:master Jan 14, 2025
52 checks passed
@apfitzge apfitzge deleted the new_receive_and_buffer_2024_11_26 branch January 14, 2025 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants