Introduce subsystem benchmarking tool #2528

sandreim · 2023-11-28T13:32:03Z

This tool makes it easy to run parachain consensus stress/performance testing on your development machine or in CI.

Motivation

The parachain consensus node implementation spans across many modules which we call subsystems. Each subsystem is responsible for a small part of logic of the parachain consensus pipeline, but in general the most load and performance issues are localized in just a few core subsystems like availability-recovery, approval-voting or dispute-coordinator. In the absence of such a tool, we would run large test nets to load/stress test these parts of the system. Setting up and making sense of the amount of data produced by such a large test is very expensive, hard to orchestrate and is a huge development time sink.

PR contents

CLI tool
Data Availability Read test
reusable mockups and components needed so far
Documentation on how to get started

Data Availability Read test

An overseer is built with using a real availability-recovery susbsytem instance while dependent subsystems like av-store, network-bridge and runtime-api are mocked. The network bridge will emulate all the network peers and their answering to requests.

The test is going to be run for a number of blocks. For each block it will generate send a “RecoverAvailableData” request for an arbitrary number of candidates. We wait for the subsystem to respond to all requests before moving to the next block.
At the same time we collect the usual subsystem metrics and task CPU metrics and show some nice progress reports while running.

Here is how the CLI looks like:

[2023-11-28T13:06:27Z INFO  subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms })
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Created test environment.
[2023-11-28T13:06:27Z INFO  subsystem-bench::availability] Pre-generating 60 candidates.
[2023-11-28T13:06:30Z INFO  subsystem-bench::core] Initializing network emulation for 1000 peers.
[2023-11-28T13:06:30Z INFO  subsystem-bench::availability] Current block 1/3
[2023-11-28T13:06:30Z INFO  substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999
[2023-11-28T13:06:30Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] Block time 6262ms
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:37Z INFO  subsystem-bench::availability] Current block 2/3
[2023-11-28T13:06:37Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] Block time 6369ms
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:43Z INFO  subsystem-bench::availability] Current block 3/3
[2023-11-28T13:06:43Z INFO  subsystem_bench::availability] 20 recoveries pending
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time 6194ms
[2023-11-28T13:06:49Z INFO  subsystem-bench::availability] Sleeping till end of block (0ms)
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] All blocks processed in 18829ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Throughput: 102400 KiB/block
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] Block time: 6276 ms
[2023-11-28T13:06:49Z INFO  subsystem_bench::availability] 
    
    Total received from network: 415 MiB
    Total sent to network: 724 KiB
    Total subsystem CPU usage 24.00s
    CPU usage per block 8.00s
    Total test environment CPU usage 0.15s
    CPU usage per block 0.05s

Prometheus/Grafana stack in action

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

…reim/subsystem-bench

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

alexggh

Looks, good to me!

polkadot/node/subsystem-bench/src/core/mock/network_bridge.rs

polkadot/node/network/availability-recovery/src/lib.rs

09215c5 Backport from `polkadot-sdk` + bump (#2725) 6327261 Bump serde from 1.0.192 to 1.0.193 fff9ddd Bump sysinfo from 0.29.10 to 0.29.11 4be99fe Monitoring and alerts for Rococo/Westend (#2710) 67a683a Bump ed25519-dalek from 2.0.0 to 2.1.0 8e0e794 quick and dirty fix for the `wait -p` and older distros (#2712) 3ab6562 Add withdraw reserve assets to zombienet tests (#2711) c2c409b increase init timeouts in zombienet tests (#2706) a8c60b4 fix lane id and bridged chain id (#2705) 9ac0f26 removed bp-asset-hub-kusama and bp-asset-hub-polkadot (#2703) 4916475 Some fixes for zombienet tests (polkadot-staging) (#2704) 6f9a147 zombienet from Wococo to Westend (#2699) 3ba7910 Porting changes from polkadot-sdk to polkadot-staging - before update subtree with removed wococo stuff (#2696) 653448f Remove Woococo related stuff (#2692) 03aaab2 Gitspiegel polkadot staging (#2695) 702a4c1 Drop Rialto <> Millau bridges (#2663) (#2694) 6a63b5f Start version guards for the ED loop (#2678) 896b9a9 typo (#2690) 671d27c Bump serde from 1.0.190 to 1.0.192 991b229 Bump clap from 4.4.7 to 4.4.8 ec267ec Bump env_logger from 0.10.0 to 0.10.1 592e407 Bump tokio from 1.33.0 to 1.34.0 c49ce3d Bump serde_json from 1.0.107 to 1.0.108 04b3319 Update subxt-codegen version (#2674) 03f9804 backport #2139 (#2673) 49245dd removed unused PARACHAINS_FINALITY_PALLET_NAME constant (#2670) 658a3f5 BHR/BHWE spec_version according to the `polkadot-sdk` (#2668) 7666b94 Nit from `polkadot-sdk` (#2665) b5c43bb Adjusted constant because for measuring we used mistakenly rococo constants (#2664) 062449d Add Rococo<>Westend bridge support/relay (#2647) 55eb44e Add basic zombienet test to be used in the future (#2649) (#2660) 93b6b3f Bump clap from 4.4.6 to 4.4.7 4c01ab0 Bump futures from 0.3.28 to 0.3.29 a31a6c0 Bump tempfile from 3.8.0 to 3.8.1 bcdfe83 Bump serde from 1.0.189 to 1.0.190 f7433b0 Port #2648 to polkadot-staging (#2651) 3896738 Bump scale-info from 2.9.0 to 2.10.0 12d62c5 Bump thiserror from 1.0.49 to 1.0.50 1d78aa1 Backport from `polkadot-sdk` with actual master (#2633) ab4de94 Grandpa justifications: Avoid duplicate vote ancestries (#2634) (#2635) 465562a add missing crate descriptions (#2629) 28d3680 Bump fixed-hash 67528c4 Bump serde from 1.0.188 to 1.0.189 d450c47 Bump time from 0.3.29 to 0.3.30 6a19f83 Bump async-trait from 0.1.73 to 0.1.74 a92d213 Millau, Rialto: accept equivocation reports (#2614) (#2617) a61f777 Bump tokio from 1.32.0 to 1.33.0 0052f64 Bump subxt from 0.32.0 to 0.32.1 ccc849d Bump num-traits from 0.2.16 to 0.2.17 22f2752 apply late suggestions for #2600 (#2603) 0320172 actualize check_obsolete_call comment (#2601) 5cbbd25 Reject transactions if bridge pallets are halted (#2600) ca4dfe3 Bump subxt from 0.31.0 to 0.32.0 8bf7b58 Bump clap from 4.4.4 to 4.4.6 88b0b99 Bump thiserror from 1.0.48 to 1.0.49 263833b https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3833103 (#2589) 4f44968 Backport changes from polkadot-sdk (#2588) 7200ed1 fiox overflow when computing priority boost (#2587) e02cbd3 Bump time from 0.3.28 to 0.3.29 a097dd2 Bump clap from 4.4.3 to 4.4.4 801ce88 Merge bulletin chain changes into polkadot staging (#2574) a3803ce Add unit tests for the equivocation detection loop (#2571) 26dfc31 Bump clap from 4.4.2 to 4.4.3 66a8beb Bump serde_json from 1.0.106 to 1.0.107 18c50da Bump trie-db from 0.27.1 to 0.28.0 4c4fa92 Equivocation detection loop: Reorganize block checking logic as state machine (#2555) (#2557) 6bd317a Bump serde_json from 1.0.105 to 1.0.106 a7e6bfd Backport for polkadot-sdk#1446 (#2546) d9f8050 Bump sysinfo from 0.29.9 to 0.29.10 901f44c Bump thiserror from 1.0.47 to 1.0.48 82eeb50 Bump sysinfo from 0.29.8 to 0.29.9 a0c934b Bump strum from 0.24.1 to 0.25.0 1064fbf Bump subxt from 0.28.0 to 0.31.0 e50398d bridges subtree fixes (#2528) 99af075 Markdown linter (#1309) (#2526) 733ff0f `polkadot-staging` branch: Use polkadot-sdk dependencies (#2524) e8a59f1 Fix benchmark with new XCM::V3 `MAX_INSTRUCTIONS_TO_DECODE` (#2514) 62b185d Backport `polkadot-sdk` changes to `polkadot-staging` (#2518) d9658f4 Fix equivocation detection containers startup (#2516) (#2517) d65db28 Backport: building images from locally built binaries (#2513) 5fdbaf4 Start the equivocation detection loop from the complex relayer (#2507) (#2512) 7fbb67d Backport: Implement basic equivocations detection loop (#2375) cb7efe2 Manually update deps in polkadot staging (#2371) d17981f #2351 to polkadot-staging (#2359) git-subtree-dir: bridges git-subtree-split: 09215c5

…reim/subsystem-bench Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

…reim/subsystem-bench

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

paritytech-cicd-pr · 2023-12-14T09:09:34Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4723275

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

This tool makes it easy to run parachain consensus stress/performance testing on your development machine or in CI. ## Motivation The parachain consensus node implementation spans across many modules which we call subsystems. Each subsystem is responsible for a small part of logic of the parachain consensus pipeline, but in general the most load and performance issues are localized in just a few core subsystems like `availability-recovery`, `approval-voting` or `dispute-coordinator`. In the absence of such a tool, we would run large test nets to load/stress test these parts of the system. Setting up and making sense of the amount of data produced by such a large test is very expensive, hard to orchestrate and is a huge development time sink. ## PR contents - CLI tool - Data Availability Read test - reusable mockups and components needed so far - Documentation on how to get started ### Data Availability Read test An overseer is built with using a real `availability-recovery` susbsytem instance while dependent subsystems like `av-store`, `network-bridge` and `runtime-api` are mocked. The network bridge will emulate all the network peers and their answering to requests. The test is going to be run for a number of blocks. For each block it will generate send a “RecoverAvailableData” request for an arbitrary number of candidates. We wait for the subsystem to respond to all requests before moving to the next block. At the same time we collect the usual subsystem metrics and task CPU metrics and show some nice progress reports while running. ### Here is how the CLI looks like: ``` [2023-11-28T13:06:27Z INFO subsystem_bench::core::display] n_validators = 1000, n_cores = 20, pov_size = 5120 - 5120, error = 3, latency = Some(PeerLatency { min_latency: 1ms, max_latency: 100ms }) [2023-11-28T13:06:27Z INFO subsystem-bench::availability] Generating template candidate index=0 pov_size=5242880 [2023-11-28T13:06:27Z INFO subsystem-bench::availability] Created test environment. [2023-11-28T13:06:27Z INFO subsystem-bench::availability] Pre-generating 60 candidates. [2023-11-28T13:06:30Z INFO subsystem-bench::core] Initializing network emulation for 1000 peers. [2023-11-28T13:06:30Z INFO subsystem-bench::availability] Current block 1/3 [2023-11-28T13:06:30Z INFO substrate_prometheus_endpoint] 〽️ Prometheus exporter started at 127.0.0.1:9999 [2023-11-28T13:06:30Z INFO subsystem_bench::availability] 20 recoveries pending [2023-11-28T13:06:37Z INFO subsystem_bench::availability] Block time 6262ms [2023-11-28T13:06:37Z INFO subsystem-bench::availability] Sleeping till end of block (0ms) [2023-11-28T13:06:37Z INFO subsystem-bench::availability] Current block 2/3 [2023-11-28T13:06:37Z INFO subsystem_bench::availability] 20 recoveries pending [2023-11-28T13:06:43Z INFO subsystem_bench::availability] Block time 6369ms [2023-11-28T13:06:43Z INFO subsystem-bench::availability] Sleeping till end of block (0ms) [2023-11-28T13:06:43Z INFO subsystem-bench::availability] Current block 3/3 [2023-11-28T13:06:43Z INFO subsystem_bench::availability] 20 recoveries pending [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time 6194ms [2023-11-28T13:06:49Z INFO subsystem-bench::availability] Sleeping till end of block (0ms) [2023-11-28T13:06:49Z INFO subsystem_bench::availability] All blocks processed in 18829ms [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Throughput: 102400 KiB/block [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Block time: 6276 ms [2023-11-28T13:06:49Z INFO subsystem_bench::availability] Total received from network: 415 MiB Total sent to network: 724 KiB Total subsystem CPU usage 24.00s CPU usage per block 8.00s Total test environment CPU usage 0.15s CPU usage per block 0.05s ``` ### Prometheus/Grafana stack in action <img width="1246" alt="Screenshot 2023-11-28 at 15 11 10" src="https://github.com/paritytech/polkadot-sdk/assets/54316454/eaa47422-4a5e-4a3a-aaef-14ca644c1574"> <img width="1246" alt="Screenshot 2023-11-28 at 15 12 01" src="https://github.com/paritytech/polkadot-sdk/assets/54316454/237329d6-1710-4c27-8f67-5fb11d7f66ea"> <img width="1246" alt="Screenshot 2023-11-28 at 15 12 38" src="https://github.com/paritytech/polkadot-sdk/assets/54316454/a07119e8-c9f1-4810-a1b3-f1b7b01cf357"> --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

68d8650 Bump thiserror from 1.0.50 to 1.0.51 009c989 remove no longer valid check from the ensure_weights_are_correct (#2740) 94c44a7 Added Rococo BH <> Rococo Bulletin bridge (#2724) 5fe0f2f Bump tokio from 1.34.0 to 1.35.0 25f8251 Grafana update stuff (#2733) 06fbe8b Improved `ExportXcm::validate` implementation for BridgeHubs - step 1 (#2727) 390e836 Select header that will be fully refunded in on-demand batch finality relay (#2729) ce701dd separate constants for average and worst case relay headers (#2728) 09215c5 Backport from `polkadot-sdk` + bump (#2725) 6327261 Bump serde from 1.0.192 to 1.0.193 fff9ddd Bump sysinfo from 0.29.10 to 0.29.11 4be99fe Monitoring and alerts for Rococo/Westend (#2710) 67a683a Bump ed25519-dalek from 2.0.0 to 2.1.0 8e0e794 quick and dirty fix for the `wait -p` and older distros (#2712) 3ab6562 Add withdraw reserve assets to zombienet tests (#2711) c2c409b increase init timeouts in zombienet tests (#2706) a8c60b4 fix lane id and bridged chain id (#2705) 9ac0f26 removed bp-asset-hub-kusama and bp-asset-hub-polkadot (#2703) 4916475 Some fixes for zombienet tests (polkadot-staging) (#2704) 6f9a147 zombienet from Wococo to Westend (#2699) 3ba7910 Porting changes from polkadot-sdk to polkadot-staging - before update subtree with removed wococo stuff (#2696) 653448f Remove Woococo related stuff (#2692) 03aaab2 Gitspiegel polkadot staging (#2695) 702a4c1 Drop Rialto <> Millau bridges (#2663) (#2694) 6a63b5f Start version guards for the ED loop (#2678) 896b9a9 typo (#2690) 671d27c Bump serde from 1.0.190 to 1.0.192 991b229 Bump clap from 4.4.7 to 4.4.8 ec267ec Bump env_logger from 0.10.0 to 0.10.1 592e407 Bump tokio from 1.33.0 to 1.34.0 c49ce3d Bump serde_json from 1.0.107 to 1.0.108 04b3319 Update subxt-codegen version (#2674) 03f9804 backport #2139 (#2673) 49245dd removed unused PARACHAINS_FINALITY_PALLET_NAME constant (#2670) 658a3f5 BHR/BHWE spec_version according to the `polkadot-sdk` (#2668) 7666b94 Nit from `polkadot-sdk` (#2665) b5c43bb Adjusted constant because for measuring we used mistakenly rococo constants (#2664) 062449d Add Rococo<>Westend bridge support/relay (#2647) 55eb44e Add basic zombienet test to be used in the future (#2649) (#2660) 93b6b3f Bump clap from 4.4.6 to 4.4.7 4c01ab0 Bump futures from 0.3.28 to 0.3.29 a31a6c0 Bump tempfile from 3.8.0 to 3.8.1 bcdfe83 Bump serde from 1.0.189 to 1.0.190 f7433b0 Port #2648 to polkadot-staging (#2651) 3896738 Bump scale-info from 2.9.0 to 2.10.0 12d62c5 Bump thiserror from 1.0.49 to 1.0.50 1d78aa1 Backport from `polkadot-sdk` with actual master (#2633) ab4de94 Grandpa justifications: Avoid duplicate vote ancestries (#2634) (#2635) 465562a add missing crate descriptions (#2629) 28d3680 Bump fixed-hash 67528c4 Bump serde from 1.0.188 to 1.0.189 d450c47 Bump time from 0.3.29 to 0.3.30 6a19f83 Bump async-trait from 0.1.73 to 0.1.74 a92d213 Millau, Rialto: accept equivocation reports (#2614) (#2617) a61f777 Bump tokio from 1.32.0 to 1.33.0 0052f64 Bump subxt from 0.32.0 to 0.32.1 ccc849d Bump num-traits from 0.2.16 to 0.2.17 22f2752 apply late suggestions for #2600 (#2603) 0320172 actualize check_obsolete_call comment (#2601) 5cbbd25 Reject transactions if bridge pallets are halted (#2600) ca4dfe3 Bump subxt from 0.31.0 to 0.32.0 8bf7b58 Bump clap from 4.4.4 to 4.4.6 88b0b99 Bump thiserror from 1.0.48 to 1.0.49 263833b https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/3833103 (#2589) 4f44968 Backport changes from polkadot-sdk (#2588) 7200ed1 fiox overflow when computing priority boost (#2587) e02cbd3 Bump time from 0.3.28 to 0.3.29 a097dd2 Bump clap from 4.4.3 to 4.4.4 801ce88 Merge bulletin chain changes into polkadot staging (#2574) a3803ce Add unit tests for the equivocation detection loop (#2571) 26dfc31 Bump clap from 4.4.2 to 4.4.3 66a8beb Bump serde_json from 1.0.106 to 1.0.107 18c50da Bump trie-db from 0.27.1 to 0.28.0 4c4fa92 Equivocation detection loop: Reorganize block checking logic as state machine (#2555) (#2557) 6bd317a Bump serde_json from 1.0.105 to 1.0.106 a7e6bfd Backport for polkadot-sdk#1446 (#2546) d9f8050 Bump sysinfo from 0.29.9 to 0.29.10 901f44c Bump thiserror from 1.0.47 to 1.0.48 82eeb50 Bump sysinfo from 0.29.8 to 0.29.9 a0c934b Bump strum from 0.24.1 to 0.25.0 1064fbf Bump subxt from 0.28.0 to 0.31.0 e50398d bridges subtree fixes (#2528) 99af075 Markdown linter (#1309) (#2526) 733ff0f `polkadot-staging` branch: Use polkadot-sdk dependencies (#2524) e8a59f1 Fix benchmark with new XCM::V3 `MAX_INSTRUCTIONS_TO_DECODE` (#2514) 62b185d Backport `polkadot-sdk` changes to `polkadot-staging` (#2518) d9658f4 Fix equivocation detection containers startup (#2516) (#2517) d65db28 Backport: building images from locally built binaries (#2513) 5fdbaf4 Start the equivocation detection loop from the complex relayer (#2507) (#2512) 7fbb67d Backport: Implement basic equivocations detection loop (#2375) cb7efe2 Manually update deps in polkadot staging (#2371) d17981f #2351 to polkadot-staging (#2359) git-subtree-dir: bridges git-subtree-split: 68d8650

## Summary Built on top of the tooling and ideas introduced in #2528, this PR introduces a synthetic benchmark for measuring and assessing the performance characteristics of the approval-voting and approval-distribution subsystems. Currently this allows, us to simulate the behaviours of these systems based on the following dimensions: ``` TestConfiguration: # Test 1 - objective: !ApprovalsTest last_considered_tranche: 89 min_coalesce: 1 max_coalesce: 6 enable_assignments_v2: true send_till_tranche: 60 stop_when_approved: false coalesce_tranche_diff: 12 workdir_prefix: "/tmp" num_no_shows_per_candidate: 0 approval_distribution_expected_tof: 6.0 approval_distribution_cpu_ms: 3.0 approval_voting_cpu_ms: 4.30 n_validators: 500 n_cores: 100 n_included_candidates: 100 min_pov_size: 1120 max_pov_size: 5120 peer_bandwidth: 524288000000 bandwidth: 524288000000 latency: min_latency: secs: 0 nanos: 1000000 max_latency: secs: 0 nanos: 100000000 error: 0 num_blocks: 10 ``` ## The approach 1. We build a real overseer with the real implementations for approval-voting and approval-distribution subsystems. 2. For a given network size, for each validator we pre-computed all potential assignments and approvals it would send, because this a computation heavy operation this will be cached on a file on disk and be re-used if the generation parameters don't change. 3. The messages will be sent accordingly to the configured parameters and those are split into 3 main benchmarking scenarios. ## Benchmarking scenarios ### Best case scenario *approvals_throughput_best_case.yaml* It send to the approval-distribution only the minimum required tranche to gathered the needed_approvals, so that a candidate is approved. ### Behaviour in the presence of no-shows *approvals_no_shows.yaml* It sends the tranche needed to approve a candidate when we have a maximum of *num_no_shows_per_candidate* tranches with no-shows for each candidate. ### Maximum throughput *approvals_throughput.yaml* It sends all the tranches for each block and measures the used CPU and necessary network bandwidth. by the approval-voting and approval-distribution subsystem. ## How to run it ``` cargo run -p polkadot-subsystem-bench --release -- test-sequence --path polkadot/node/subsystem-bench/examples/approvals_throughput.yaml ``` ## Evaluating performance ### Use the real subsystems metrics If you follow the steps in https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-grafana for installing locally prometheus and grafana, all real metrics for the `approval-distribution`, `approval-voting` and overseer are available. E.g: <img width="2149" alt="Screenshot 2023-12-05 at 11 07 46" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/cb8ae2dd-178b-4922-bfa4-dc37e572ed38"> <img width="2551" alt="Screenshot 2023-12-05 at 11 09 42" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/8b4542ba-88b9-46f9-9b70-cc345366081b"> <img width="2154" alt="Screenshot 2023-12-05 at 11 10 15" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/b8874d8d-632e-443a-9840-14ad8e90c54f"> <img width="2535" alt="Screenshot 2023-12-05 at 11 10 52" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/779a439f-fd18-4985-bb80-85d5afad78e2"> ### Profile with pyroscope 1. Setup pyroscope following the steps in https://github.com/paritytech/polkadot-sdk/tree/master/polkadot/node/subsystem-bench#install-pyroscope, then run any of the benchmark scenario with `--profile` as the arguments. 2. Open the pyroscope dashboard in grafana, e.g: <img width="2544" alt="Screenshot 2024-01-09 at 17 09 58" src="https://github.com/paritytech/polkadot-sdk/assets/49718502/58f50c99-a910-4d20-951a-8b16639303d9"> ### Useful logs 1. Network bandwidth requirements: ``` Payload bytes received from peers: 503993 KiB total, 50399 KiB/block Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block ``` 2. Cpu usage by the approval-distribution/approval-voting subsystems. ``` approval-distribution CPU usage 84.061s approval-distribution CPU usage per block 8.406s approval-voting CPU usage 96.532s approval-voting CPU usage per block 9.653s ``` 3. Time passed until a given block is approved ``` Chain selection approved after 3500 ms hash=0x0101010101010101010101010101010101010101010101010101010101010101 Chain selection approved after 4500 ms hash=0x0202020202020202020202020202020202020202020202020202020202020202 ``` ### Using benchmark to quantify improvements from #1178 + #1191 Using a versi-node we compare the scenarios where all new optimisations are disabled with a scenarios where tranche0 assignments are sent in a single message and a conservative simulation where the coalescing of approvals gives us just 50% reduction in the number of messages we send. Overall, what we see is a speedup of around 30-40% in the time it takes to process the necessary messages and a 30-40% reduction in the necessary bandwidth. #### Best case scenario comparison(minimum required tranches sent). Unoptimised ``` Number of blocks: 10 Payload bytes received from peers: 53289 KiB total, 5328 KiB/block Payload bytes sent to peers: 52489 KiB total, 5248 KiB/block approval-distribution CPU usage 6.732s approval-distribution CPU usage per block 0.673s approval-voting CPU usage 9.523s approval-voting CPU usage per block 0.952s ``` vs Optimisation enabled ``` Number of blocks: 10 Payload bytes received from peers: 32141 KiB total, 3214 KiB/block Payload bytes sent to peers: 37314 KiB total, 3731 KiB/block approval-distribution CPU usage 4.658s approval-distribution CPU usage per block 0.466s approval-voting CPU usage 6.236s approval-voting CPU usage per block 0.624s ``` #### Worst case all tranches sent, very unlikely happens when sharding breaks. Unoptimised ``` Number of blocks: 10 Payload bytes received from peers: 746393 KiB total, 74639 KiB/block Payload bytes sent to peers: 729151 KiB total, 72915 KiB/block approval-distribution CPU usage 118.681s approval-distribution CPU usage per block 11.868s approval-voting CPU usage 124.118s approval-voting CPU usage per block 12.412s ``` vs optimised ``` Number of blocks: 10 Payload bytes received from peers: 503993 KiB total, 50399 KiB/block Payload bytes sent to peers: 629971 KiB total, 62997 KiB/block approval-distribution CPU usage 84.061s approval-distribution CPU usage per block 8.406s approval-voting CPU usage 96.532s approval-voting CPU usage per block 9.653s ``` ## TODOs [x] Polish implementation. [x] Use what we have so far to evaluate #1191 before merging. [x] List of features and additional dimensions we want to use for benchmarking. [x] Run benchmark on hardware similar with versi and kusama nodes. [ ] Add benchmark to be run in CI for catching regression in performance. [ ] Rebase on latest changes for network emulation. --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io> Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: Andrei Sandu <andrei-mihail@parity.io> Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>

Polkadot-Forum · 2024-05-21T14:59:54Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/what-are-subsystem-benchmarks/8212/1

sandreim added 30 commits October 25, 2023 11:50

skeleton

01af630

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

wip

7c22abe

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

measure tput and fixes

c3adc77

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

add network emulation

31b0351

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

cleanup

e4bb037

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Add latency emulation

a694924

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

support multiple pov sizes

7ca4dba

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

new metric in recovery and more testing

0430b5b

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

CLI update and fixes

027bcd8

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

peer stats

5a05da0

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Switch stats to atomics

895e8d6

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

add more network metrics, new load generator

a2fb0c9

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

refactor

d1b9fa3

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

pretty cli + minor refactor + remove unused

c5937ab

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

update

d6c259d

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

remove comment

050529b

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

separate cli options for availability

cb38be5

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

implement unified and extensible configuration

24a736a

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Prepare to swtich to overseer

2843865

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into sand…

fd4620e

…reim/subsystem-bench

add mocked subsystems

b17a147

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

full overseer based implementation complete

4724d8c

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

make clean

7aed30f

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

more cleaning

b51485b

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

more cleaning

7e46444

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

proper overseer control

d3df927

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

refactor CLI display of env stats

7557768

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Add grafana dashboards for DA read

787dc00

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

network stats fixes

cd18f8d

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

move examples and grafana

e8506b3

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

fix comment

29d80fa

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

alexggh mentioned this pull request Dec 5, 2023

Introduce approval-voting/distribution benchmark #2621

Merged

alexggh approved these changes Dec 5, 2023

View reviewed changes

polkadot/node/subsystem-bench/src/core/mock/network_bridge.rs Show resolved Hide resolved

polkadot/node/network/availability-recovery/src/lib.rs Outdated Show resolved Hide resolved

sandreim added 4 commits December 8, 2023 13:35

Merge branch 'master' of github.com:paritytech/polkadot-sdk into sand…

74e68bb

…reim/subsystem-bench Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

cargo lock

4d21e5b

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

more review feedback

3e25fdc

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

change back to debug

1458a73

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

alindima approved these changes Dec 8, 2023

View reviewed changes

sandreim added the R0-silent Changes should not be mentioned in any release notes label Dec 11, 2023

sandreim added 7 commits December 12, 2023 14:11

fix test build

baa124e

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

fix markdown

fde982f

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

fix test

47c2643

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

taplo fix

8b49077

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Merge branch 'master' of github.com:paritytech/polkadot-sdk into sand…

42f6834

…reim/subsystem-bench

cargo lock

4c86691

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

clippy

bd128b3

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

more clippy

1021efb

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

sandreim merged commit 8a6e9ef into master Dec 14, 2023
114 of 116 checks passed

sandreim deleted the sandreim/subsystem-bench branch December 14, 2023 10:57

github-actions bot mentioned this pull request Feb 19, 2024

Update substrate/polkadot/cumulus from v1.3.0 to v1.6.0 moondance-labs/tanssi#419

Closed

github-actions bot mentioned this pull request Mar 13, 2024

Update polkadot-sdk from v1.3.0 to v1.7.2 moonbeam-foundation/moonbeam#2703

Closed

bkchr pushed a commit that referenced this pull request Apr 10, 2024

bridges subtree fixes (#2528)

4dd35b6

This was referenced Jun 5, 2024

Update polkadot-sdk from v1.7.0 to v1.11.0 moondance-labs/tanssi#573

Closed

Update polkadot-sdk from v1.10.0 to v1.11.0 moondance-labs/tanssi#577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce subsystem benchmarking tool #2528

Introduce subsystem benchmarking tool #2528

sandreim commented Nov 28, 2023 •

edited

Loading

alexggh left a comment

paritytech-cicd-pr commented Dec 14, 2023

Polkadot-Forum commented May 21, 2024

Introduce subsystem benchmarking tool #2528

Introduce subsystem benchmarking tool #2528

Conversation

sandreim commented Nov 28, 2023 • edited Loading

Motivation

PR contents

Data Availability Read test

Here is how the CLI looks like:

Prometheus/Grafana stack in action

alexggh left a comment

Choose a reason for hiding this comment

paritytech-cicd-pr commented Dec 14, 2023

Polkadot-Forum commented May 21, 2024

sandreim commented Nov 28, 2023 •

edited

Loading