[forge] custom profiles & add processor latency metrics #14924

rustielin · 2024-10-10T00:47:12Z

Description

Add indexer processor metrics to Forge. This does not include metrics from the indexer processors written with the processor SDK.

Also add support for providing a non-default forge deployer profile. By default tests use the forge profile, which is configured here: https://github.com/aptos-labs/internal-ops/blob/main/infra/cli/commands/forge/profiles.ts. For components outside of aptos-core like the indexer processors, it uses the build tagged main. However there are cases where we want to test experimental processors as well. In this case to keep the configuration easy to manage, we recommend creating a new test profile i.e. forge-indexer-sdk-test-metrics (some concise descriptive name), which specifies an exact build

After this PR lands, devs can use the FORGE_INDEXER_DEPLOYER_PROFILE var in adhoc forge to trigger their tests with new profiles, which can include experimental builds of indexer processors

In the future we might want to expose the entire deployer config to aptos-core too, and expose all test configuration in a config file rather than rust in main.rs of the forge-cli itself.

Bonus:

Start attempting to tune the indexer_test suite success thresholds

How Has This Been Tested?

CI + workflow dispatch once it lands

Also spin up using the create command with a different profile minimal:

cargo run -p aptos-forge-cli -- operator create --namespace forge-rustielin-test --num-validators 1 --enable-indexer --deployer-profile minimal

Key Areas to Review

Type of Change

Which Components or Systems Does This Change Impact?

Checklist

I have read and followed the CONTRIBUTING doc
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I identified and added all stakeholders and component owners affected by this change as reviewers
I tested both happy and unhappy path of the functionality
I have made corresponding changes to the documentation

trunk-io · 2024-10-10T00:47:15Z

⏱️ 5h 25m total CI duration on this PR

Slowest 15 Jobs	Cumulative Duration	Recent Runs
execution-performance / single-node-performance	1h 29m	🟥 🟩 🟩 🟩
dispatch_event	19m	🟩
test-target-determinator	15m	🟩 🟩 🟩 🟩
dispatch_event	15m	🟩
check	15m	🟩 🟩 🟩 🟩
dispatch_event	14m	🟩
rust-cargo-deny	14m	🟩 🟩 🟩 🟩 🟩 (+3 more)
execution-performance / test-target-determinator	14m	🟩 🟩 🟩 🟩
dispatch_event	13m	🟥
check-dynamic-deps	11m	🟩 🟩 🟩 🟩 🟩 (+3 more)
forge-framework-upgrade-test / forge	10m	🟥
rust-move-tests	10m	🟩
rust-move-tests	10m	🟩
rust-move-tests	9m	🟩
rust-move-tests	9m	🟩

🚨 2 jobs on the last run were significantly faster/slower than expected

Job	Duration	vs 7d avg	Delta
execution-performance / test-target-determinator	3m	4m
test-target-determinator	3m	4m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

graphite-app · 2024-10-10T00:47:42Z

testsuite/forge/src/interface/prometheus_metrics.rs

+        samples.insert(
+            LatencyBreakdownSlice::IndexerProcessorLatency,
+            MetricSamples::new(indexer_processor_latency_samples),
+        );


Suggested change

samples.insert(

LatencyBreakdownSlice::IndexerProcessorLatency,

MetricSamples::new(indexer_processor_latency_samples),

);

Remove these lines as they duplicate the insertion of `IndexerProcessorLatency` samples. This insertion has already been performed at lines 277-280.

This duplication could lead to unexpected behavior or data inconsistencies. Consider removing one of the insertions to maintain a single, accurate representation of the IndexerProcessorLatency metrics in the samples collection.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

rtso · 2024-10-15T22:14:42Z

testsuite/forge-cli/src/main.rs

@@ -2400,7 +2427,7 @@ fn indexer_test() -> ForgeConfig {
        (
            TransactionWorkload::new(TransactionTypeArg::TokenV2AmbassadorMint, 20000)
                .with_unique_senders(),
-            (3200, 0.5, 0.5, 0.5),
+            (2000, 0.5, 0.5, 0.5),


If I want to run an adhoc forge with only one of the workloads, would I need to create a PR commenting out the other workloads and run the adhoc from that PR?

yea, or the better way is to create (in a separate PR) a testsuite for each that you want to run independently. So instead of using the indexer_test suite maybe you have like indexer_{p2p, graffio, ...}

github-actions · 2024-10-17T02:52:14Z

✅ Forge suite `realistic_env_max_load` success on `06dd8be368239cd186c92476aad9bbc61220ac67`

two traffics test: inner traffic : committed: 13799.55 txn/s, submitted: 13800.03 txn/s, expired: 0.47 txn/s, latency: 2877.75 ms, (p50: 2700 ms, p70: 3000, p90: 3100 ms, p99: 7600 ms), latency samples: 5246940
two traffics test : committed: 99.98 txn/s, latency: 2182.26 ms, (p50: 1500 ms, p70: 2500, p90: 2900 ms, p99: 9100 ms), latency samples: 1760
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.254, avg: 0.220", "QsPosToProposal: max: 0.837, avg: 0.718", "ConsensusProposalToOrdered: max: 0.323, avg: 0.304", "ConsensusOrderedToCommit: max: 0.498, avg: 0.466", "ConsensusProposalToCommit: max: 0.797, avg: 0.769"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.81s no progress at version 2498572 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.23s no progress at version 2498570 (avg 8.23s) [limit 15].
Test Ok

github-actions · 2024-10-17T02:54:54Z

✅ Forge suite `compat` success on `b29f09f57e898d8d211c8bc3e303f6e50bba2266` ==> `06dd8be368239cd186c92476aad9bbc61220ac67`

Compatibility test results for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 (PR)
1. Check liveness of validators at old version: b29f09f57e898d8d211c8bc3e303f6e50bba2266
compatibility::simple-validator-upgrade::liveness-check : committed: 12298.09 txn/s, latency: 2315.98 ms, (p50: 1700 ms, p70: 1900, p90: 2300 ms, p99: 27000 ms), latency samples: 491620
2. Upgrading first Validator to new version: 06dd8be368239cd186c92476aad9bbc61220ac67
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7249.39 txn/s, latency: 3870.10 ms, (p50: 4300 ms, p70: 4500, p90: 4800 ms, p99: 5000 ms), latency samples: 133480
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6701.62 txn/s, latency: 4760.87 ms, (p50: 5100 ms, p70: 5500, p90: 6500 ms, p99: 6700 ms), latency samples: 221660
3. Upgrading rest of first batch to new version: 06dd8be368239cd186c92476aad9bbc61220ac67
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6591.04 txn/s, latency: 4152.84 ms, (p50: 4400 ms, p70: 4700, p90: 5700 ms, p99: 6000 ms), latency samples: 124960
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7341.66 txn/s, latency: 4373.50 ms, (p50: 4500 ms, p70: 4600, p90: 6300 ms, p99: 6500 ms), latency samples: 245000
4. upgrading second batch to new version: 06dd8be368239cd186c92476aad9bbc61220ac67
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10661.15 txn/s, latency: 2517.23 ms, (p50: 2500 ms, p70: 2900, p90: 3400 ms, p99: 3700 ms), latency samples: 189560
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9836.65 txn/s, latency: 3088.10 ms, (p50: 2700 ms, p70: 3600, p90: 4400 ms, p99: 5600 ms), latency samples: 328060
5. check swarm health
Compatibility test for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 passed
Test Ok

github-actions · 2024-10-17T02:55:11Z

✅ Forge suite `framework_upgrade` success on `b29f09f57e898d8d211c8bc3e303f6e50bba2266` ==> `06dd8be368239cd186c92476aad9bbc61220ac67`

Compatibility test results for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 (PR)
Upgrade the nodes to version: 06dd8be368239cd186c92476aad9bbc61220ac67
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1212.93 txn/s, submitted: 1216.07 txn/s, failed submission: 3.13 txn/s, expired: 3.13 txn/s, latency: 2560.89 ms, (p50: 2400 ms, p70: 2700, p90: 3800 ms, p99: 4900 ms), latency samples: 108340
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1097.50 txn/s, submitted: 1100.47 txn/s, failed submission: 2.98 txn/s, expired: 2.98 txn/s, latency: 2757.64 ms, (p50: 2400 ms, p70: 2900, p90: 5100 ms, p99: 7000 ms), latency samples: 95900
5. check swarm health
Compatibility test for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 passed
Upgrade the remaining nodes to version: 06dd8be368239cd186c92476aad9bbc61220ac67
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1101.11 txn/s, submitted: 1104.25 txn/s, failed submission: 3.15 txn/s, expired: 3.15 txn/s, latency: 2697.95 ms, (p50: 2400 ms, p70: 3000, p90: 4500 ms, p99: 5800 ms), latency samples: 97920
Test Ok

graphite-app bot reviewed Oct 10, 2024

View reviewed changes

rustielin force-pushed the rustielin/indexer-forge-processor-latency branch 2 times, most recently from 2d2228e to cc13744 Compare October 10, 2024 21:46

rustielin changed the title ~~[WIP][forge] add swarm metrics for indexer processor latency~~ [WIP][forge] custom profiles & add processor latency metrics Oct 11, 2024

rustielin changed the title ~~[WIP][forge] custom profiles & add processor latency metrics~~ [forge] custom profiles & add processor latency metrics Oct 11, 2024

rustielin force-pushed the rustielin/indexer-forge-processor-latency branch from cc13744 to c1f430d Compare October 11, 2024 19:44

rustielin marked this pull request as ready for review October 11, 2024 19:45

rustielin requested a review from a team as a code owner October 11, 2024 19:45

rustielin requested review from rtso and a team October 11, 2024 19:46

rustielin added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Oct 11, 2024