Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[forge] custom profiles & add processor latency metrics #14924

Merged
merged 4 commits into from
Oct 17, 2024

Conversation

rustielin
Copy link
Contributor

@rustielin rustielin commented Oct 10, 2024

Description

Add indexer processor metrics to Forge. This does not include metrics from the indexer processors written with the processor SDK.

Also add support for providing a non-default forge deployer profile. By default tests use the forge profile, which is configured here: https://github.com/aptos-labs/internal-ops/blob/main/infra/cli/commands/forge/profiles.ts. For components outside of aptos-core like the indexer processors, it uses the build tagged main. However there are cases where we want to test experimental processors as well. In this case to keep the configuration easy to manage, we recommend creating a new test profile i.e. forge-indexer-sdk-test-metrics (some concise descriptive name), which specifies an exact build

After this PR lands, devs can use the FORGE_INDEXER_DEPLOYER_PROFILE var in adhoc forge to trigger their tests with new profiles, which can include experimental builds of indexer processors

In the future we might want to expose the entire deployer config to aptos-core too, and expose all test configuration in a config file rather than rust in main.rs of the forge-cli itself.

Bonus:

  • Start attempting to tune the indexer_test suite success thresholds

How Has This Been Tested?

CI + workflow dispatch once it lands

Also spin up using the create command with a different profile minimal:

cargo run -p aptos-forge-cli -- operator create --namespace forge-rustielin-test --num-validators 1 --enable-indexer --deployer-profile minimal     

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Oct 10, 2024

⏱️ 5h 25m total CI duration on this PR
Slowest 15 Jobs Cumulative Duration Recent Runs
execution-performance / single-node-performance 1h 29m 🟥🟩🟩🟩
dispatch_event 19m 🟩
test-target-determinator 15m 🟩🟩🟩🟩
dispatch_event 15m 🟩
check 15m 🟩🟩🟩🟩
dispatch_event 14m 🟩
rust-cargo-deny 14m 🟩🟩🟩🟩🟩 (+3 more)
execution-performance / test-target-determinator 14m 🟩🟩🟩🟩
dispatch_event 13m 🟥
check-dynamic-deps 11m 🟩🟩🟩🟩🟩 (+3 more)
forge-framework-upgrade-test / forge 10m 🟥
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 9m 🟩
rust-move-tests 9m 🟩

🚨 2 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / test-target-determinator 3m 4m -26%
test-target-determinator 3m 4m -26%

settingsfeedbackdocs ⋅ learn more about trunk.io

Comment on lines +294 to +295
samples.insert(
LatencyBreakdownSlice::IndexerProcessorLatency,
MetricSamples::new(indexer_processor_latency_samples),
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
samples.insert(
LatencyBreakdownSlice::IndexerProcessorLatency,
MetricSamples::new(indexer_processor_latency_samples),
);
Remove these lines as they duplicate the insertion of `IndexerProcessorLatency` samples. This insertion has already been performed at lines 277-280.

This duplication could lead to unexpected behavior or data inconsistencies. Consider removing one of the insertions to maintain a single, accurate representation of the IndexerProcessorLatency metrics in the samples collection.

Spotted by Graphite Reviewer

Is this helpful? React 👍 or 👎 to let us know.

@rustielin rustielin force-pushed the rustielin/indexer-forge-processor-latency branch 2 times, most recently from 2d2228e to cc13744 Compare October 10, 2024 21:46
@rustielin rustielin changed the title [WIP][forge] add swarm metrics for indexer processor latency [WIP][forge] custom profiles & add processor latency metrics Oct 11, 2024
@rustielin rustielin changed the title [WIP][forge] custom profiles & add processor latency metrics [forge] custom profiles & add processor latency metrics Oct 11, 2024
@rustielin rustielin force-pushed the rustielin/indexer-forge-processor-latency branch from cc13744 to c1f430d Compare October 11, 2024 19:44
@rustielin rustielin marked this pull request as ready for review October 11, 2024 19:45
@rustielin rustielin requested a review from a team as a code owner October 11, 2024 19:45
@rustielin rustielin requested review from rtso and a team October 11, 2024 19:46
@rustielin rustielin added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Oct 11, 2024

This comment has been minimized.

This comment has been minimized.

@rustielin rustielin force-pushed the rustielin/indexer-forge-processor-latency branch from c1f430d to 1729917 Compare October 15, 2024 17:49

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@@ -2400,7 +2427,7 @@ fn indexer_test() -> ForgeConfig {
(
TransactionWorkload::new(TransactionTypeArg::TokenV2AmbassadorMint, 20000)
.with_unique_senders(),
(3200, 0.5, 0.5, 0.5),
(2000, 0.5, 0.5, 0.5),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I want to run an adhoc forge with only one of the workloads, would I need to create a PR commenting out the other workloads and run the adhoc from that PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, or the better way is to create (in a separate PR) a testsuite for each that you want to run independently. So instead of using the indexer_test suite maybe you have like indexer_{p2p, graffio, ...}

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@rustielin rustielin enabled auto-merge (squash) October 16, 2024 23:40

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 06dd8be368239cd186c92476aad9bbc61220ac67

two traffics test: inner traffic : committed: 13799.55 txn/s, submitted: 13800.03 txn/s, expired: 0.47 txn/s, latency: 2877.75 ms, (p50: 2700 ms, p70: 3000, p90: 3100 ms, p99: 7600 ms), latency samples: 5246940
two traffics test : committed: 99.98 txn/s, latency: 2182.26 ms, (p50: 1500 ms, p70: 2500, p90: 2900 ms, p99: 9100 ms), latency samples: 1760
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.254, avg: 0.220", "QsPosToProposal: max: 0.837, avg: 0.718", "ConsensusProposalToOrdered: max: 0.323, avg: 0.304", "ConsensusOrderedToCommit: max: 0.498, avg: 0.466", "ConsensusProposalToCommit: max: 0.797, avg: 0.769"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.81s no progress at version 2498572 (avg 0.21s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 8.23s no progress at version 2498570 (avg 8.23s) [limit 15].
Test Ok

Copy link
Contributor

✅ Forge suite compat success on b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67

Compatibility test results for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 (PR)
1. Check liveness of validators at old version: b29f09f57e898d8d211c8bc3e303f6e50bba2266
compatibility::simple-validator-upgrade::liveness-check : committed: 12298.09 txn/s, latency: 2315.98 ms, (p50: 1700 ms, p70: 1900, p90: 2300 ms, p99: 27000 ms), latency samples: 491620
2. Upgrading first Validator to new version: 06dd8be368239cd186c92476aad9bbc61220ac67
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7249.39 txn/s, latency: 3870.10 ms, (p50: 4300 ms, p70: 4500, p90: 4800 ms, p99: 5000 ms), latency samples: 133480
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6701.62 txn/s, latency: 4760.87 ms, (p50: 5100 ms, p70: 5500, p90: 6500 ms, p99: 6700 ms), latency samples: 221660
3. Upgrading rest of first batch to new version: 06dd8be368239cd186c92476aad9bbc61220ac67
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6591.04 txn/s, latency: 4152.84 ms, (p50: 4400 ms, p70: 4700, p90: 5700 ms, p99: 6000 ms), latency samples: 124960
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 7341.66 txn/s, latency: 4373.50 ms, (p50: 4500 ms, p70: 4600, p90: 6300 ms, p99: 6500 ms), latency samples: 245000
4. upgrading second batch to new version: 06dd8be368239cd186c92476aad9bbc61220ac67
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10661.15 txn/s, latency: 2517.23 ms, (p50: 2500 ms, p70: 2900, p90: 3400 ms, p99: 3700 ms), latency samples: 189560
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9836.65 txn/s, latency: 3088.10 ms, (p50: 2700 ms, p70: 3600, p90: 4400 ms, p99: 5600 ms), latency samples: 328060
5. check swarm health
Compatibility test for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 passed
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67

Compatibility test results for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 (PR)
Upgrade the nodes to version: 06dd8be368239cd186c92476aad9bbc61220ac67
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1212.93 txn/s, submitted: 1216.07 txn/s, failed submission: 3.13 txn/s, expired: 3.13 txn/s, latency: 2560.89 ms, (p50: 2400 ms, p70: 2700, p90: 3800 ms, p99: 4900 ms), latency samples: 108340
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1097.50 txn/s, submitted: 1100.47 txn/s, failed submission: 2.98 txn/s, expired: 2.98 txn/s, latency: 2757.64 ms, (p50: 2400 ms, p70: 2900, p90: 5100 ms, p99: 7000 ms), latency samples: 95900
5. check swarm health
Compatibility test for b29f09f57e898d8d211c8bc3e303f6e50bba2266 ==> 06dd8be368239cd186c92476aad9bbc61220ac67 passed
Upgrade the remaining nodes to version: 06dd8be368239cd186c92476aad9bbc61220ac67
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1101.11 txn/s, submitted: 1104.25 txn/s, failed submission: 3.15 txn/s, expired: 3.15 txn/s, latency: 2697.95 ms, (p50: 2400 ms, p70: 3000, p90: 4500 ms, p99: 5800 ms), latency samples: 97920
Test Ok

@rustielin rustielin merged commit 7d3d83c into main Oct 17, 2024
50 checks passed
@rustielin rustielin deleted the rustielin/indexer-forge-processor-latency branch October 17, 2024 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants