Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(resharding): delay split shard of flat store until resharding block is final #12415

Merged

Conversation

Trisfald
Copy link
Contributor

@Trisfald Trisfald commented Nov 8, 2024

Contents of this PR:

  • New features
    • Actual splitting of flat storage is delayed until the target resharding block becomes final
    • Scheduled resharding event can be overridden. This makes resharding work in many chain fork scenarios (not all of them though)
    • Added FlatStorageReshardingTaskSchedulingStatus to express the current state of scheduled tasks waiting for resharding block finality
  • Changes
    • Shard catchup doesn't wait anymore for resharding block finality. It is now a consequence of the fact that the shard split happens on a final block.
    • FlatStorageReshardingTaskStatus renamed into FlatStorageReshardingTaskResult for clarity
    • ReshardingActor now takes care of re-trying "postponed" tasks.

Part of #12174

@Trisfald Trisfald requested a review from a team as a code owner November 8, 2024 11:45
Copy link

codecov bot commented Nov 8, 2024

Codecov Report

Attention: Patch coverage is 92.76808% with 29 lines in your changes missing coverage. Please review.

Project coverage is 71.74%. Comparing base (a03f42c) to head (a2d8bd3).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/flat_storage_resharder.rs 92.95% 14 Missing and 12 partials ⚠️
chain/chain/src/resharding/resharding_actor.rs 90.62% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #12415       +/-   ##
===========================================
+ Coverage   39.74%   71.74%   +32.00%     
===========================================
  Files         842      843        +1     
  Lines      170756   170977      +221     
  Branches   170756   170977      +221     
===========================================
+ Hits        67872   122675    +54803     
+ Misses      98642    42931    -55711     
- Partials     4242     5371     +1129     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.27% <0.00%> (-0.01%) ⬇️
integration-tests 39.33% <31.42%> (-0.03%) ⬇️
linux 71.06% <85.53%> (+32.35%) ⬆️
linux-nightly 71.32% <92.76%> (+32.05%) ⬆️
macos 50.78% <85.53%> (?)
pytests 1.57% <0.00%> (-0.01%) ⬇️
sanity-checks 1.38% <0.00%> (-0.01%) ⬇️
unittests 64.36% <85.53%> (?)
upgradability 0.21% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +211 to +212
/// Returns `Err` if:
/// - a resharding event is in progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add under what circumstances this might happen and / or why do you think it should never happen?

Comment on lines 753 to 765
match chain_store.get_block_hash_by_height(resharding_height) {
Ok(hash) => {
if hash != *resharding_hash {
error!(target: "resharding", ?resharding_height, ?resharding_hash, ?hash, "resharding block not in canonical chain!");
return FlatStorageReshardingTaskSchedulingStatus::Failed;
}
}
Err(err) => {
error!(target: "resharding", ?resharding_height, ?resharding_hash, ?err, "can't find resharding block hash by height!");
return FlatStorageReshardingTaskSchedulingStatus::Failed;
}
}
FlatStorageReshardingTaskSchedulingStatus::CanStart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe the following would be a bit nicer? Up to you.

        match chain_store.get_block_hash_by_height(resharding_height) {
            Ok(hash) if hash == *resharding_hash => {
              FlatStorageReshardingTaskSchedulingStatus::CanStart
            }
            Ok(hash) {
                error!(target: "resharding", ?resharding_height, ?resharding_hash, ?hash, "resharding block not in canonical chain!");
                FlatStorageReshardingTaskSchedulingStatus::Failed
            }
            Err(err) => {
                error!(target: "resharding", ?resharding_height, ?resharding_hash, ?err, "can't find resharding block hash by height!");
                FlatStorageReshardingTaskSchedulingStatus::Failed
            }
        }

col::BUFFERED_RECEIPT_INDICES | col::BUFFERED_RECEIPT => {
copy_kv_to_left_child(&status, key, value, store_update)
}
_ => unreachable!(),
_ => unreachable!("key: {:?} should not appear in flat store!", key),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how we can make it so that when a new key is added the compiler would force the developer to support it here. If I recall correctly those are integers so it's not exactly possible to list all possible values in the match patterns. Perhaps you can find some clever way still to do it here using conditional patterns or something with the the max of col? If not then maybe a unit test would work?

Not need to do it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests should catch it, as it happened in this case with bandwidth scheduler, once the new column starts to be used. Otherwise there can be some trick with ALL_COLUMNS_WITH_NAMES.. I'll think about that!

@@ -828,18 +925,60 @@ fn copy_kv_to_left_child(
pub enum FlatStorageReshardingEventStatus {
/// Split a shard.
/// Includes the parent shard uid and the operation' status.
SplitShard(ShardUId, SplittingParentStatus),
SplitShard(ShardUId, SplittingParentStatus, ExecutionStatus),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what's the difference or rename the different status fields? Is the first one more or a Prepare/ Preprocess/ Schedule to the latter's Execution?

Comment on lines 42 to 47
FlatStorageReshardingTaskResult::Cancelled => {
panic!("shard catchup task should never be cancelled!")
}
FlatStorageReshardingTaskStatus::Postponed => {
// The task has been postponed for later execution. Nothing to do.
FlatStorageReshardingTaskResult::Postponed => {
panic!("shard catchup task should never be postponed!")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for those cases to be triggered? If not perhaps you can introduce a smaller type for the catchup result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I'll introduce another type to avoid these 'impossible' cases

Comment on lines 39 to 47
FlatStorageReshardingTaskResult::Failed => {
panic!("impossible to recover from a flat storage shard catchup failure!")
}
FlatStorageReshardingTaskStatus::Cancelled => {
// The task has been cancelled. Nothing else to do.
FlatStorageReshardingTaskResult::Cancelled => {
panic!("shard catchup task should never be cancelled!")
}
FlatStorageReshardingTaskStatus::Postponed => {
// The task has been postponed for later execution. Nothing to do.
FlatStorageReshardingTaskResult::Postponed => {
panic!("shard catchup task should never be postponed!")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re panics -> It's true that it's impossible to recover but also the node can continue operating using the memtrie. Perhaps we can just write an error (or even repeating error) and mark the FS as corrupted? Not a biggie actually the benefit would be only delaying the failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I had the impression it might be possible to retry some failed operations, but I haven't put that much thought into it yet

@Trisfald Trisfald added this pull request to the merge queue Nov 11, 2024
Merged via the queue into near:master with commit ba0b237 Nov 11, 2024
29 checks passed
@Trisfald Trisfald deleted the flat-storage-resharding-delay-split-shard branch November 11, 2024 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants