Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CostTracker aware of inflight transactions #437

Merged
merged 7 commits into from
Mar 29, 2024

Conversation

steviez
Copy link

@steviez steviez commented Mar 26, 2024

Problem

When a leader is packing a Bank, transactions are added to the cost-tracker and then later updated or removed, depending on whether the transactions were committed or not. However, it is also possible for a Bank to get frozen while these transactions are "in-flight".

After the bank is frozen, the bank is sent over to CostUpdateService almost immediately. The result is that we've observed a Bank having its' cost details reported to metrics BEFORE the updates / removals were made to the CostTracker for these in-flight transactions. This causes a leader to submit metrics with an over-reported cost in comparison to what we calculate from replaying the block.

Taking over work done by @apfitzge in #398

Summary of Changes

  • Add in-flight transaction count to CostTracker
  • Increment / decrement the in-flight transaction count in BankingStage
  • Make CostUpdateService wait (with a timeout) for in-flight transaction count to settle to zero

Fixes #366

@steviez steviez marked this pull request as ready for review March 26, 2024 17:58
@steviez
Copy link
Author

steviez commented Mar 26, 2024

Some alternatives that were considered:

  • Andrew pursued an alternative option in CostTracker - wait for in-flight transactions before reporting #373 that would put a lock in Bank and have the updating / reporting thread synchronize through that
  • Another possible alternative is holding the freeze lock longer. Currently, it is dropped here inside Consumer::execute_and_commit_transactions_locked():
    drop(freeze_lock);

    If we returned the held freeze lock to the caller (Consumer::process_and_record_transactions_with_pre_results()), we could process the cost updates/removals with the lock held (ie release at the end of this block):
    let mut execute_and_commit_transactions_output =
    self.execute_and_commit_transactions_locked(bank, &batch);
    // Once the accounts are new transactions can enter the pipeline to process them
    let (_, unlock_us) = measure_us!(drop(batch));
    let ExecuteAndCommitTransactionsOutput {
    ref mut retryable_transaction_indexes,
    ref execute_and_commit_timings,
    ref commit_transactions_result,
    ..
    } = execute_and_commit_transactions_output;
    // Costs of all transactions are added to the cost_tracker before processing.
    // To ensure accurate tracking of compute units, transactions that ultimately
    // were not included in the block should have their cost removed.
    QosService::remove_costs(
    transaction_qos_cost_results.iter(),
    commit_transactions_result.as_ref().ok(),
    bank,
    );
    // once feature `apply_cost_tracker_during_replay` is activated, leader shall no longer
    // adjust block with executed cost (a behavior more inline with bankless leader), it
    // should use requested, or default `compute_unit_limit` as transaction's execution cost.
    if !bank
    .feature_set
    .is_active(&feature_set::apply_cost_tracker_during_replay::id())
    {
    QosService::update_costs(
    transaction_qos_cost_results.iter(),
    commit_transactions_result.as_ref().ok(),
    bank,
    );
    }

    It doesn't look like a lot of stuff to hold the lock for, but in any case, does seem a bit riskier than the approach this PR does

@steviez steviez requested a review from apfitzge March 26, 2024 18:15
@apfitzge
Copy link

I don't think holding the freeze lock for longer will work, we'd also have to acquire the lock earlier. Right now we only hold freeze lock during (record, commit), but not execution.

I think that approach is riskier and not sure we should be backporting that.
A long execution, possibly due to bug?, would then prevent other threads from moving on to the next bank.

apfitzge
apfitzge previously approved these changes Mar 26, 2024
Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - let's get approval from second set of eyes before merging.

@steviez
Copy link
Author

steviez commented Mar 26, 2024

I don't think holding the freeze lock for longer will work, we'd also have to acquire the lock earlier. Right now we only hold freeze lock during (record, commit), but not execution.

Ah right, thank you. So yeah, VERY significantly extending the duration of holding that lock

I think that approach is riskier and not sure we should be backporting that. A long execution, possibly due to bug?, would then prevent other threads from moving on to the next bank.

Yep, 100% agreed on that approach being much riskier

@steviez steviez requested a review from t-nelson March 26, 2024 18:32
@steviez
Copy link
Author

steviez commented Mar 26, 2024

lgtm - let's get approval from second set of eyes before merging.

@t-nelson - I think you've followed along with the issue this PR aims to address so adding you as a reviewer with Tao being OOO this week

@codecov-commenter
Copy link

codecov-commenter commented Mar 26, 2024

Codecov Report

Attention: Patch coverage is 96.29630% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (182d27f) to head (c30bbbb).
Report is 6 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #437   +/-   ##
=======================================
  Coverage    81.8%    81.9%           
=======================================
  Files         842      842           
  Lines      228492   228518   +26     
=======================================
+ Hits       187104   187165   +61     
+ Misses      41388    41353   -35     

@steviez steviez changed the title Make CostTracker are of inflight transactions Make CostTracker aware of inflight transactions Mar 27, 2024
@steviez steviez requested a review from bw-solana March 27, 2024 19:43
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I think we could optimize the read locking on the cost reporting

apfitzge
apfitzge previously approved these changes Mar 28, 2024
bw-solana
bw-solana previously approved these changes Mar 28, 2024
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@steviez steviez dismissed stale reviews from bw-solana and apfitzge via c30bbbb March 28, 2024 21:26
@steviez
Copy link
Author

steviez commented Mar 28, 2024

c30bbbb 😬

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - how did I misss the report remove 🤦

@steviez steviez merged commit 9076348 into anza-xyz:master Mar 29, 2024
37 checks passed
@steviez steviez deleted the cost_tracker_inflight branch March 29, 2024 12:34
@steviez steviez added the v1.18 label Mar 29, 2024
Copy link

mergify bot commented Mar 29, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify bot pushed a commit that referenced this pull request Mar 29, 2024
When a leader is packing a Bank, transactions costs are added to the
CostTracker and then later updated or removed, depending on if the
tx is committed. However, it is possible for a Bank to be frozen while
there are several tx's in flight.

CostUpdateService submits a metric with cost information almost
immediately after a Bank has been frozen. The result is that we have
observed cost details being submitted before some cost removals take
place, which causes a massive over-reporting of the block cost
compared to actual.

This PR adds a field to track the number of transactions that are
inflight, and adds a simple mechanism to try to allow that value to
settle to zero before submitting the datapoint. The number of inflight
tx's is submitted with the datapoint, so even if the value does not
settle to zero, we can still detect this case and know the metric is
tainted.

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
(cherry picked from commit 9076348)
steviez added a commit that referenced this pull request Mar 29, 2024
When a leader is packing a Bank, transactions costs are added to the
CostTracker and then later updated or removed, depending on if the
tx is committed. However, it is possible for a Bank to be frozen while
there are several tx's in flight.

CostUpdateService submits a metric with cost information almost
immediately after a Bank has been frozen. The result is that we have
observed cost details being submitted before some cost removals take
place, which causes a massive over-reporting of the block cost
compared to actual.

This PR adds a field to track the number of transactions that are
inflight, and adds a simple mechanism to try to allow that value to
settle to zero before submitting the datapoint. The number of inflight
tx's is submitted with the datapoint, so even if the value does not
settle to zero, we can still detect this case and know the metric is
tainted.

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
(cherry picked from commit 9076348)
steviez added a commit that referenced this pull request Apr 16, 2024
When a leader is packing a Bank, transactions costs are added to the
CostTracker and then later updated or removed, depending on if the
tx is committed. However, it is possible for a Bank to be frozen while
there are several tx's in flight.

CostUpdateService submits a metric with cost information almost
immediately after a Bank has been frozen. The result is that we have
observed cost details being submitted before some cost removals take
place, which causes a massive over-reporting of the block cost
compared to actual.

This PR adds a field to track the number of transactions that are
inflight, and adds a simple mechanism to try to allow that value to
settle to zero before submitting the datapoint. The number of inflight
tx's is submitted with the datapoint, so even if the value does not
settle to zero, we can still detect this case and know the metric is
tainted.

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
(cherry picked from commit 9076348)
@steviez steviez added the v1.17 label Apr 17, 2024
Copy link

mergify bot commented Apr 17, 2024

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

mergify bot pushed a commit that referenced this pull request Apr 17, 2024
When a leader is packing a Bank, transactions costs are added to the
CostTracker and then later updated or removed, depending on if the
tx is committed. However, it is possible for a Bank to be frozen while
there are several tx's in flight.

CostUpdateService submits a metric with cost information almost
immediately after a Bank has been frozen. The result is that we have
observed cost details being submitted before some cost removals take
place, which causes a massive over-reporting of the block cost
compared to actual.

This PR adds a field to track the number of transactions that are
inflight, and adds a simple mechanism to try to allow that value to
settle to zero before submitting the datapoint. The number of inflight
tx's is submitted with the datapoint, so even if the value does not
settle to zero, we can still detect this case and know the metric is
tainted.

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
(cherry picked from commit 9076348)
steviez added a commit that referenced this pull request Apr 24, 2024
When a leader is packing a Bank, transactions costs are added to the
CostTracker and then later updated or removed, depending on if the
tx is committed. However, it is possible for a Bank to be frozen while
there are several tx's in flight.

CostUpdateService submits a metric with cost information almost
immediately after a Bank has been frozen. The result is that we have
observed cost details being submitted before some cost removals take
place, which causes a massive over-reporting of the block cost
compared to actual.

This PR adds a field to track the number of transactions that are
inflight, and adds a simple mechanism to try to allow that value to
settle to zero before submitting the datapoint. The number of inflight
tx's is submitted with the datapoint, so even if the value does not
settle to zero, we can still detect this case and know the metric is
tainted.

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
(cherry picked from commit 9076348)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cost Tracker - stats reporting bug
4 participants