Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replay: do not recheck duplicate confirmation if already confirmed #1237

Merged
merged 3 commits into from
Aug 2, 2024

Conversation

AshwinSekar
Copy link

Problem

We perform the replay duplicate confirmed check as long as the slot is not supermajority confirmed.
This can lead to the same slot being checked multiple times, notifying state machine multiple times, and the log / metric being spammed as well.

Summary of Changes

Separately track the duplicate confirmed status and only check slots that are not duplicate confirmed.

Fixes #1236

@AshwinSekar AshwinSekar requested review from carllin and wen-coding May 7, 2024 20:37
Comment on lines 4065 to 4104
if progress.is_duplicate_confirmed(*slot).unwrap_or(false) {
// Already duplicate confirmed, nothing left to do
continue;
}

progress.set_duplicate_confirmed_slot(*slot);
assert!(*frozen_hash != Hash::default());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, would this existing

agave/core/src/replay_stage.rs

Lines 4084 to 4088 in abdfbca

if let Some(prev_hash) = duplicate_confirmed_slots.insert(*slot, *frozen_hash) {
assert_eq!(prev_hash, *frozen_hash);
// Already processed this signal
return;
}
check be sufficient if we updated duplicate_confirmed_slots in confirm_forks?

Better yet, can we just remove duplicate_confirmed_slots and replace it with the progress map like you have here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the idea for splitting this up into confirm_forks and mark_slot_confirmed was so confirm_forks is side effect free for unit testing.

And we can't replace duplicate_confirmed_slots with progress_map unfortunately because progress_map only holds replayed forks, however a slot could be duplicate_confirmed through gossip, so we have to track it separately.

Copy link

@carllin carllin Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is there a reason not to have progress map also track gossip duplicate confirmed forks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a try here: AshwinSekar@9d14e0d

The main issue is that ProgressMap is a wrapper around this HashMap:

pub struct ProgressMap {
progress_map: HashMap<Slot, ForkProgress>,
}

We insert new entries based on the bank:

agave/core/src/replay_stage.rs

Lines 2932 to 2934 in 3ca6719

let bank_progress = progress.entry(bank.slot()).or_insert_with(|| {
ForkProgress::new_from_bank(
&bank,

Because ForkProgress and it's sub struct ForkStats are mostly comprised of information we can only get from replay:

pub struct ForkProgress {
pub is_dead: bool,
pub fork_stats: ForkStats,
pub propagated_stats: PropagatedStats,
pub replay_stats: Arc<RwLock<ReplaySlotStats>>,
pub replay_progress: Arc<RwLock<ConfirmationProgress>>,
pub retransmit_info: RetransmitInfo,
// Note `num_blocks_on_fork` and `num_dropped_blocks_on_fork` only
// count new blocks replayed since last restart, which won't include
// blocks already existing in the ledger/before snapshot at start,
// so these stats do not span all of time
pub num_blocks_on_fork: u64,
pub num_dropped_blocks_on_fork: u64,
}

When we root we use bank forks to retain:

pub fn handle_new_root(&mut self, bank_forks: &BankForks) {
self.progress_map
.retain(|k, _| bank_forks.get(*k).is_some());
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really wanted to put it in progress_map, we would have to put it at the top level something like:

 pub struct ProgressMap { 
     progress_map: HashMap<Slot, ForkProgress>, 
     duplicate_confirmed_slots:  BTreeMap<Slot, Hash>
} 

But at that point we're just tracking two separate things in a struct called ProgressMap which isn't really a progress indicator anymore.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, essentially duplicate_confirmed_slots right now tracks slots that might not be in ProgressMap at all because we haven't replayed the slot, but have detected the slot via gossip

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.1%. Comparing base (4ae2ca1) to head (abdfbca).
Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #1237     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         886      886             
  Lines      236439   236499     +60     
=========================================
+ Hits       194252   194272     +20     
- Misses      42187    42227     +40     

@AshwinSekar AshwinSekar force-pushed the duplicate-confirmed branch from abdfbca to 4a142ca Compare July 24, 2024 20:28
@AshwinSekar
Copy link
Author

I've also removed the previous SupermajorityVoted level, since it actually isn't used anywhere and only creates confusing logs, as OC cannot roll up votes through replay.

@AshwinSekar AshwinSekar requested a review from carllin July 24, 2024 21:40
core/src/replay_stage.rs Show resolved Hide resolved
core/src/replay_stage.rs Outdated Show resolved Hide resolved
if !bank.is_frozen() {
continue;
}
if tower.is_slot_duplicate_confirmed(*slot, voted_stakes, total_stake) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realized the is_slot_duplicate_confirmed detection uses voted_stakes which pops off some slots due to the simulated vote. Probably could improve that a bit by having a separate version of voted_stakes that doesn't simulate the vote

Would also make the

datapoint_info!(
                        "bank_weight",
                        ("slot", bank_slot, i64),
                        ("fork_stake", stats.fork_stake, i64),
                        ("fork_weight", stats.fork_weight(), f64),
                    );

log a bit more accurate

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, actually what do you think about using fork choice for DC? Can address in a future PR.

It seems like it should be fine to aggregate DC amongst children forks? It would also save us from the double iteration, rn we iterate over all the newly frozen banks to grab a voted stakes from each and then iterate over all unrooted slot to see if any voted stakes DC's the unrooted slot.

Instead we could just iterate once over all unrooted slots and check fork choice to see if it's DC.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only downside of using fork choice is it only tracks the latest vote, so if people jump around I think it can miss DC

Comment on lines 4065 to 4104
if progress.is_duplicate_confirmed(*slot).unwrap_or(false) {
// Already duplicate confirmed, nothing left to do
continue;
}

progress.set_duplicate_confirmed_slot(*slot);
assert!(*frozen_hash != Hash::default());
Copy link

@carllin carllin Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is there a reason not to have progress map also track gossip duplicate confirmed forks?

@AshwinSekar AshwinSekar force-pushed the duplicate-confirmed branch from 4a142ca to 636a001 Compare August 1, 2024 17:31
@AshwinSekar AshwinSekar merged commit 749d6fd into anza-xyz:master Aug 2, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix spammy duplicate confirmation log
3 participants