-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(resharding): delay split shard of flat store until resharding block is final #12415
feat(resharding): delay split shard of flat store until resharding block is final #12415
Conversation
…ding-delay-split-shard
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12415 +/- ##
===========================================
+ Coverage 39.74% 71.74% +32.00%
===========================================
Files 842 843 +1
Lines 170756 170977 +221
Branches 170756 170977 +221
===========================================
+ Hits 67872 122675 +54803
+ Misses 98642 42931 -55711
- Partials 4242 5371 +1129
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/// Returns `Err` if: | ||
/// - a resharding event is in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add under what circumstances this might happen and / or why do you think it should never happen?
match chain_store.get_block_hash_by_height(resharding_height) { | ||
Ok(hash) => { | ||
if hash != *resharding_hash { | ||
error!(target: "resharding", ?resharding_height, ?resharding_hash, ?hash, "resharding block not in canonical chain!"); | ||
return FlatStorageReshardingTaskSchedulingStatus::Failed; | ||
} | ||
} | ||
Err(err) => { | ||
error!(target: "resharding", ?resharding_height, ?resharding_hash, ?err, "can't find resharding block hash by height!"); | ||
return FlatStorageReshardingTaskSchedulingStatus::Failed; | ||
} | ||
} | ||
FlatStorageReshardingTaskSchedulingStatus::CanStart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe the following would be a bit nicer? Up to you.
match chain_store.get_block_hash_by_height(resharding_height) {
Ok(hash) if hash == *resharding_hash => {
FlatStorageReshardingTaskSchedulingStatus::CanStart
}
Ok(hash) {
error!(target: "resharding", ?resharding_height, ?resharding_hash, ?hash, "resharding block not in canonical chain!");
FlatStorageReshardingTaskSchedulingStatus::Failed
}
Err(err) => {
error!(target: "resharding", ?resharding_height, ?resharding_hash, ?err, "can't find resharding block hash by height!");
FlatStorageReshardingTaskSchedulingStatus::Failed
}
}
col::BUFFERED_RECEIPT_INDICES | col::BUFFERED_RECEIPT => { | ||
copy_kv_to_left_child(&status, key, value, store_update) | ||
} | ||
_ => unreachable!(), | ||
_ => unreachable!("key: {:?} should not appear in flat store!", key), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how we can make it so that when a new key is added the compiler would force the developer to support it here. If I recall correctly those are integers so it's not exactly possible to list all possible values in the match patterns. Perhaps you can find some clever way still to do it here using conditional patterns or something with the the max of col? If not then maybe a unit test would work?
Not need to do it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests should catch it, as it happened in this case with bandwidth scheduler, once the new column starts to be used. Otherwise there can be some trick with ALL_COLUMNS_WITH_NAMES
.. I'll think about that!
@@ -828,18 +925,60 @@ fn copy_kv_to_left_child( | |||
pub enum FlatStorageReshardingEventStatus { | |||
/// Split a shard. | |||
/// Includes the parent shard uid and the operation' status. | |||
SplitShard(ShardUId, SplittingParentStatus), | |||
SplitShard(ShardUId, SplittingParentStatus, ExecutionStatus), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what's the difference or rename the different status fields? Is the first one more or a Prepare/ Preprocess/ Schedule to the latter's Execution?
FlatStorageReshardingTaskResult::Cancelled => { | ||
panic!("shard catchup task should never be cancelled!") | ||
} | ||
FlatStorageReshardingTaskStatus::Postponed => { | ||
// The task has been postponed for later execution. Nothing to do. | ||
FlatStorageReshardingTaskResult::Postponed => { | ||
panic!("shard catchup task should never be postponed!") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for those cases to be triggered? If not perhaps you can introduce a smaller type for the catchup result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'll introduce another type to avoid these 'impossible' cases
FlatStorageReshardingTaskResult::Failed => { | ||
panic!("impossible to recover from a flat storage shard catchup failure!") | ||
} | ||
FlatStorageReshardingTaskStatus::Cancelled => { | ||
// The task has been cancelled. Nothing else to do. | ||
FlatStorageReshardingTaskResult::Cancelled => { | ||
panic!("shard catchup task should never be cancelled!") | ||
} | ||
FlatStorageReshardingTaskStatus::Postponed => { | ||
// The task has been postponed for later execution. Nothing to do. | ||
FlatStorageReshardingTaskResult::Postponed => { | ||
panic!("shard catchup task should never be postponed!") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re panics -> It's true that it's impossible to recover but also the node can continue operating using the memtrie. Perhaps we can just write an error (or even repeating error) and mark the FS as corrupted? Not a biggie actually the benefit would be only delaying the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I had the impression it might be possible to retry some failed operations, but I haven't put that much thought into it yet
Contents of this PR:
FlatStorageReshardingTaskSchedulingStatus
to express the current state of scheduled tasks waiting for resharding block finalityFlatStorageReshardingTaskStatus
renamed intoFlatStorageReshardingTaskResult
for clarityReshardingActor
now takes care of re-trying "postponed" tasks.Part of #12174