-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make unified scheduler's new task code fallible #1071
Make unified scheduler's new task code fallible #1071
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also this pr contains bunch of minor unrelated clean ups...)
I think these may be distracting me from the actual functional diff here.
In what manner would we expect the scheduling of new tasks to fail? afaict this does not ever return an error in this PR?
I thought the major issue here is that failed transaction results (not scheduling) do not get properly propagated back to the replay thread.
29ca732
to
410dbc5
Compare
fair point. i think i stuffed too much in this prep pr...
again, thanks for raising good question... That lead me to rethink the impl to begin with (thus delayed reply...) I'm reorganiging the pr queue. the renewed first prep pr is this: #1126 Also, I started to draft up the retionale of this seeming odd function signature here: #1122 I'll close this pr for now |
4e18ba0
to
71e36c9
Compare
/// That said, calling this multiple times is completely acceptable after the error observation | ||
/// from `schedule_execution()`. While it's not guaranteed, the same `.clone()`-ed errors of | ||
/// the first bad transaction are usually returned across invocations, | ||
fn recover_error_after_abort(&mut self) -> TransactionError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the new fn for fallible new task code-path
52cd340
to
89461f5
Compare
89461f5
to
27922d5
Compare
// Lastly, this non-atomic nature is intentional for optimizing the fast code-path | ||
let mut scheduler_guard = self.inner.scheduler.write().unwrap(); | ||
let scheduler = scheduler_guard.as_mut().unwrap(); | ||
return Err(scheduler.recover_error_after_abort()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while the current actual wip .recover_error_after_abort()
impl panics with todo!()
, this code will never reach because the current actual wip .schedule_execution()
impl never returns Err(_)
to begin with.
I changed my mind yet again. I reopened this pr and rebooted it for yet another code-review.
This reviving is mainly because of the mother pr (#1122) got too big
hope i documented the context in detail in source code this time in this pr...
I reverted them in this pr now. |
It's difficult to tell if this is the correct interface without seeing the implementation of how we check for errors. If the former, I can see how this interface makes sense. If we're checking some shared variable for an error, it seems it'd make sense to have separate calls |
thanks for trying to review this pr again. seems i failed to begin a constructive code-review session by teasing too much by this interface-only split pr... Thanks for patience and let's pivot the review style. I created #1211 as a full-brown review-ready pr, which contains this pr changes and the actual implementation.
the actual implementation is kind of hybrid: its initial error-condition detection is piggybacked with sending txs via channel and the actual error retrieval (and internal thread joining) is checking (or memoizing) via some shared variable as a separate call ( |
} | ||
|
||
fn recover_error_after_abort(&mut self) -> TransactionError { | ||
todo!("in later pr..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of letting todo!
into the master branch. I'd much rather seen the error recovery code in this PR. OR in some initial PR with dead_code, and then this PR simply uses those changes.
It's too easy to forget todo! as a reviewing, since github's view is often limited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of letting
todo!
into the master branch. I'd much rather seen the error recovery code in this PR.
hmm, how about closing this pr and switching to review #1211? As the pr is the superset of this pr, there's no todo!()
there. Or, ...
OR in some initial PR with dead_code, and then this PR simply uses those changes.
... if the size of that pr isn't acceptable to review in one go for you, I can chunk the pr according to this.
It's too easy to forget todo! as a reviewing, since github's view is often limited
I think we can leave some check-boxes in the pr description if it works not to forget.
imo, explicit todo!()
isn't so different from implicit (undocumented-but-definitely-existing) leak sources in master. And the remaining todos is existing only in my mind at the moment... ;) speaking of it, i can dump them somewhere and maintain it if it's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, that's fine. let's just move to #1211
/// previously-scheduled bad transaction, which terminates further block verification. So, | ||
/// almost always, the returned error isn't due to the merely scheduling of the current | ||
/// transaction itself. At this point, calling this does nothing anymore while it's still safe | ||
/// to do. As soon as notified, callers is expected to stop processing upcoming transactions of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// to do. As soon as notified, callers is expected to stop processing upcoming transactions of | |
/// to do. As soon as notified, callers are expected to stop processing upcoming transactions of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1071 +/- ##
=========================================
- Coverage 82.1% 82.1% -0.1%
=========================================
Files 880 880
Lines 235665 235714 +49
=========================================
+ Hits 193716 193736 +20
- Misses 41949 41978 +29 |
Problem
Currently, there's no way for the unified scheduler to propagate errors back to the callers (the replay stage) until the bank freezing.
So, the dead-block marking by the replay stage could be delayed by maliciously-crafted blocks.
Summary of Changes
Make the new task code-path return
Result
s to forcibly return previously-scheduled transaction error when new tasks are about to be submitted to the unified scheduler to notify the replay stage earlier than reaching block boundaries.This pr is the preparation of the last major functionality of unified scheduler: proper shutdown.
EDIT: Note that this pr only changes the interfaces. sill the actual implementation doesn't return erroos. So, this is no functional change in this pr. the immediate upcoming next pr will actually implement the shutdown (warn: the impl is robust as much as i could but is quite complex at the same time...).
(also this pr contains bunch of minor unrelated clean ups...)(EDIT: this is reverted for ease of review)context: extracted from #1122