Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler Frequency Fixes #4545

Merged

Conversation

apfitzge
Copy link

Problem

  • Scheduler is doing work in too large of chunks causing issues in block-packing

Summary of Changes

  • Make the limit of transactions scanned in prio-graph does not exceed 1k
  • Receive only up to 5k packets in any single call
  • Hold decision behaves like the Consume decision (no timeout if non-empty)
  • receive completed transactions before scheduling

Fixes #

@apfitzge apfitzge added v2.0 Backport to v2.0 branch v2.1 Backport to v2.1 branch labels Jan 20, 2025
@apfitzge apfitzge self-assigned this Jan 20, 2025
Copy link

mergify bot commented Jan 20, 2025

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

Copy link

mergify bot commented Jan 20, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

@apfitzge apfitzge marked this pull request as ready for review January 20, 2025 13:41
@apfitzge apfitzge requested a review from alessandrod January 20, 2025 13:41
Comment on lines +275 to +277
if num_scanned >= self.config.max_scanned_transactions_per_scheduling_pass {
break;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral change here: break loop if the number of scanned transactions exceed the configured maximum

};

let (received_packet_results, receive_time_us) = measure_us!(self
.packet_receiver
.receive_packets(recv_timeout, remaining_queue_capacity, |packet| {
.receive_packets(recv_timeout, MAX_RECEIVE_PACKETS, |packet| {
Copy link
Author

@apfitzge apfitzge Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral change: max received number of transactions is 5000. This is a "loose maximum" as it will continue to receive packets up until it has receive >= 5000. This is because the smallest unit we receive is a Vec<PacketBatch> which obviously can contain more than one packet.

const MAX_PACKET_RECEIVE_TIME: Duration = Duration::from_millis(10);
let (recv_timeout, should_buffer) = match decision {
BufferedPacketsDecision::Consume(_) => (
BufferedPacketsDecision::Consume(_) | BufferedPacketsDecision::Hold => (
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral change: Hold can now result in 0 timeout if the container is not empty. This is intended to avoid long timeouts of 10ms if a Hold pops up between our leader slots (it can)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the profiles I've been taking so far, we seem to spend exactly 20ms doing nothing in between slots. Curious to see what happens now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you recently profile on the box with this change in there? did you see something there?

self.receive_completed()?;
self.process_transactions(&decision)?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral change: receive completed transactions just before scheduling instead of just after.

@apfitzge apfitzge merged commit 6e93845 into anza-xyz:master Jan 22, 2025
49 checks passed
mergify bot pushed a commit that referenced this pull request Jan 22, 2025
* Change PrioGraphSchedulerConfig::default for 1k maxs, 256 look ahead

* Break loop on scanned transaction count

* make Hold decision behave same as Consume during receive

* receive maximum of 5_000 packets - loose max

* receive_completed before process_transactions

(cherry picked from commit 6e93845)

# Conflicts:
#	core/src/banking_stage/transaction_scheduler/prio_graph_scheduler.rs
#	core/src/banking_stage/transaction_scheduler/receive_and_buffer.rs
#	core/src/banking_stage/transaction_scheduler/scheduler_controller.rs
mergify bot pushed a commit that referenced this pull request Jan 22, 2025
* Change PrioGraphSchedulerConfig::default for 1k maxs, 256 look ahead

* Break loop on scanned transaction count

* make Hold decision behave same as Consume during receive

* receive maximum of 5_000 packets - loose max

* receive_completed before process_transactions

(cherry picked from commit 6e93845)

# Conflicts:
#	core/src/banking_stage/transaction_scheduler/prio_graph_scheduler.rs
#	core/src/banking_stage/transaction_scheduler/receive_and_buffer.rs
#	core/src/banking_stage/transaction_scheduler/scheduler_controller.rs
apfitzge added a commit that referenced this pull request Jan 23, 2025
* Change prio_graph_scheduler configurations for 1k maxs, 256 look ahead

* Break loop on scanned transaction count

* make Hold decision behave same as Consume during receive

* receive maximum of 5_000 packets - loose max

* receive_completed before process_transactions

---------

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
apfitzge added a commit that referenced this pull request Jan 23, 2025
* Change prio_graph_scheduler configurations for 1k maxs, 256 look ahead

* Break loop on scanned transaction count

* make Hold decision behave same as Consume during receive

* receive maximum of 5_000 packets - loose max

* receive_completed before process_transactions

---------

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
nibty added a commit to x1-labs/tachyon that referenced this pull request Feb 5, 2025
* v2.0: Reclaims more old accounts in `clean` (backport of anza-xyz#4044) (anza-xyz#4089)

* Reclaims more old accounts in `clean` (anza-xyz#4044)

(cherry picked from commit 3d43824)

# Conflicts:
#	accounts-db/src/accounts_db.rs
#	accounts-db/src/accounts_db/tests.rs

* fix merge conflicts

---------

Co-authored-by: Brooks <brooks@anza.xyz>

* v2.0: Fixes clean_old_storages_with_reclaims tests (backport of anza-xyz#4147) (anza-xyz#4166)

* Fixes clean_old_storages_with_reclaims tests (anza-xyz#4147)

(cherry picked from commit 4eabeed)

# Conflicts:
#	accounts-db/src/accounts_db/tests.rs

* fix merge conflicts

---------

Co-authored-by: Brooks <brooks@anza.xyz>

* v2.0: blockstore: mark slot as dead on data shred merkle root conflict (backport of anza-xyz#3970) (anza-xyz#4074)

* blockstore: mark slot as dead on data shred merkle root conflict (anza-xyz#3970)

(cherry picked from commit 5564a94)

# Conflicts:
#	ledger/src/blockstore.rs

* fix conflicts

---------

Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>
Co-authored-by: Ashwin Sekar <ashwin@solana.com>

* Bump version to v2.0.22 (anza-xyz#4200)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* v2.0: hardcode rust version for publish-crate (anza-xyz#4228)

* Bump version to v2.0.23 (anza-xyz#4419)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* v2.0: rolls out chained Merkle shreds to ~21% of mainnet slots (backport of anza-xyz#4431) (anza-xyz#4434)

rolls out chained Merkle shreds to ~21% of mainnet slots (anza-xyz#4431)

(cherry picked from commit 9d09787)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>

* v2.0: [rpc] Fatal `getSignaturesForAddress()` when Bigtable errors (backport of anza-xyz#3700) (anza-xyz#4442)

[rpc] Fatal `getSignaturesForAddress()` when Bigtable errors (anza-xyz#3700)

* Unindent code in `get_signatures_for_address`

* Add a custom JSON-RPC error to throw when long-term storage (ie. Bigtable) can't be reached

* When the `before`/`until` signatures can't be found, throw `SignatureNotFound` instead of `RowNotFound`

* Fatal `getSignaturesForAddress` calls when Bigtable must be queried but can't be reached

(cherry picked from commit 52f132c)

Co-authored-by: Steven Luscher <steveluscher@users.noreply.github.com>

* v2.0: ci: bump [upload|download]-artifact to v4 (anza-xyz#4501)

ci: bump [upload|download]-artifact to v4

* v2.0: ci: hardcode crate publishing version (anza-xyz#4515)

ci: hardcode rust version for publish-crate

* Bump version to v2.0.24 (anza-xyz#4528)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* v2.0: fix: reduce max packet receive time during leader window (backport of anza-xyz#2801) (anza-xyz#4544)

fix: reduce max packet receive time during leader window (anza-xyz#2801)

(cherry picked from commit 20e0df4)

Co-authored-by: cavemanloverboy <93507302+cavemanloverboy@users.noreply.github.com>

* v2.0: Scheduler Frequency Fixes (backport of anza-xyz#4545) (anza-xyz#4576)

* Change prio_graph_scheduler configurations for 1k maxs, 256 look ahead

* Break loop on scanned transaction count

* make Hold decision behave same as Consume during receive

* receive maximum of 5_000 packets - loose max

* receive_completed before process_transactions

---------

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Brooks <brooks@anza.xyz>
Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>
Co-authored-by: Ashwin Sekar <ashwin@solana.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Yihau Chen <yihau.chen@icloud.com>
Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Co-authored-by: Steven Luscher <steveluscher@users.noreply.github.com>
Co-authored-by: cavemanloverboy <93507302+cavemanloverboy@users.noreply.github.com>
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
nibty added a commit to x1-labs/tachyon that referenced this pull request Feb 6, 2025
* v2.0: Reclaims more old accounts in `clean` (backport of anza-xyz#4044) (anza-xyz#4089)

* Reclaims more old accounts in `clean` (anza-xyz#4044)

(cherry picked from commit 3d43824)

# Conflicts:
#	accounts-db/src/accounts_db.rs
#	accounts-db/src/accounts_db/tests.rs

* fix merge conflicts

---------

Co-authored-by: Brooks <brooks@anza.xyz>

* v2.0: Fixes clean_old_storages_with_reclaims tests (backport of anza-xyz#4147) (anza-xyz#4166)

* Fixes clean_old_storages_with_reclaims tests (anza-xyz#4147)

(cherry picked from commit 4eabeed)

# Conflicts:
#	accounts-db/src/accounts_db/tests.rs

* fix merge conflicts

---------

Co-authored-by: Brooks <brooks@anza.xyz>

* v2.0: blockstore: mark slot as dead on data shred merkle root conflict (backport of anza-xyz#3970) (anza-xyz#4074)

* blockstore: mark slot as dead on data shred merkle root conflict (anza-xyz#3970)

(cherry picked from commit 5564a94)

# Conflicts:
#	ledger/src/blockstore.rs

* fix conflicts

---------

Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>
Co-authored-by: Ashwin Sekar <ashwin@solana.com>

* Bump version to v2.0.22 (anza-xyz#4200)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* v2.0: hardcode rust version for publish-crate (anza-xyz#4228)

* Bump version to v2.0.23 (anza-xyz#4419)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* v2.0: rolls out chained Merkle shreds to ~21% of mainnet slots (backport of anza-xyz#4431) (anza-xyz#4434)

rolls out chained Merkle shreds to ~21% of mainnet slots (anza-xyz#4431)

(cherry picked from commit 9d09787)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>

* v2.0: [rpc] Fatal `getSignaturesForAddress()` when Bigtable errors (backport of anza-xyz#3700) (anza-xyz#4442)

[rpc] Fatal `getSignaturesForAddress()` when Bigtable errors (anza-xyz#3700)

* Unindent code in `get_signatures_for_address`

* Add a custom JSON-RPC error to throw when long-term storage (ie. Bigtable) can't be reached

* When the `before`/`until` signatures can't be found, throw `SignatureNotFound` instead of `RowNotFound`

* Fatal `getSignaturesForAddress` calls when Bigtable must be queried but can't be reached

(cherry picked from commit 52f132c)

Co-authored-by: Steven Luscher <steveluscher@users.noreply.github.com>

* v2.0: ci: bump [upload|download]-artifact to v4 (anza-xyz#4501)

ci: bump [upload|download]-artifact to v4

* v2.0: ci: hardcode crate publishing version (anza-xyz#4515)

ci: hardcode rust version for publish-crate

* Bump version to v2.0.24 (anza-xyz#4528)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* v2.0: fix: reduce max packet receive time during leader window (backport of anza-xyz#2801) (anza-xyz#4544)

fix: reduce max packet receive time during leader window (anza-xyz#2801)

(cherry picked from commit 20e0df4)

Co-authored-by: cavemanloverboy <93507302+cavemanloverboy@users.noreply.github.com>

* v2.0: Scheduler Frequency Fixes (backport of anza-xyz#4545) (anza-xyz#4576)

* Change prio_graph_scheduler configurations for 1k maxs, 256 look ahead

* Break loop on scanned transaction count

* make Hold decision behave same as Consume during receive

* receive maximum of 5_000 packets - loose max

* receive_completed before process_transactions

---------

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Brooks <brooks@anza.xyz>
Co-authored-by: Ashwin Sekar <ashwin@anza.xyz>
Co-authored-by: Ashwin Sekar <ashwin@solana.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Yihau Chen <yihau.chen@icloud.com>
Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Co-authored-by: Steven Luscher <steveluscher@users.noreply.github.com>
Co-authored-by: cavemanloverboy <93507302+cavemanloverboy@users.noreply.github.com>
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2.0 Backport to v2.0 branch v2.1 Backport to v2.1 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants