-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support submit checkpoint in parallel #840
Conversation
"submitted bottom up checkpoint({}) in parent at height {}", | ||
event.height, | ||
epoch | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious about the behaviour when the number of checkpoints to submit (say 10) is actually more than number of permits (say 2). So acquire_owned
will actually block if 2 checkpoints are being submitted until one of 2 actually finishes? Even if submit_checkpoint
fails and the thread crashes, drop(submission_permit)
will be executed and no dead lock here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So acquire_owned will actually block if 2 checkpoints are being submitted until one of 2 actually finishes?
Yes, it will literally block (the for
loop just pauses).
Even if submit_checkpoint fails and the thread crashes, drop(submission_permit) will be executed and no dead lock here right?
My last version had an issue where the async task can panic before dropping the permit (because the use of unwrap()
). It's been fixed here so the permit will be dropped no matter it succeeds or fails when submitting the checkpoint.
ipc/provider/src/checkpoint.rs
Outdated
epoch | ||
.unwrap(); | ||
all_submit_tasks.push(tokio::task::spawn(async move { | ||
Self::submit_checkpoint(parent_handler_clone, submitter, bundle, event).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this actually work in parallel, because there is a sequential
enforcement in the checkpoint height, so say we are submitting two checkpoints at height 100 and 130. If height 100 is not submitted, 130 cannot be submitted. My question is really about sort of race
in the sense that if height 100 is still in the memory pool and the transaction for height 130 is submitted, will it pass the gas estimation without errors? Or will it be rejected due to nonce
not set incrementally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure whether the submit of 130 can succeed if the submit of 100 is still in progress...@aakoshh Can you share your opinions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how Lotus works exactly, but let's say that it:
- limits the number of pending transactions in the mempool to 4
- orders the pending transactions by nonce
- applies them to an in-memory check-state like fendermint
If that were true, then as long as the transactions are created in a logical order, their order could be restored by Lotus when they are included in the block. As for the checks, it probably depends on whether that particular node has seen the preceding transactions are not.
Note that we would still send the transactions sequentially, we just don't want to wait for the receipt between each submission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, blockchain nodes accept gapped transactions via their RPC and via the mpool, because they live in an eventually consistent universe. Note that any of these can happen, in addition to other scenarios, of course:
- The user submits transactions in sequence to a load-balanced RPC endpoint, and they land gapped in the backend nodes.
- The user submits out of order transactions to the same node, which is just a subcase of the gapped situation.
- The user submits many transactions, all in order, to the same backend node, but each tx takes different gossip propagation routes through the network and it arrives to various nodes in various non-sequential and gapped orders.
ipc/provider/src/checkpoint.rs
Outdated
Err(_err) => { | ||
log::debug!("Failed to submit checkpoint"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err(_err) => { | |
log::debug!("Failed to submit checkpoint"); | |
Err(err) => { | |
log::error!(error = err, height = bundle.height, "Failed to submit checkpoint"); |
Not sure how to log the height, but ignoring errors and even demoting the log to debug
level is a recipe for frustration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Changed to error
level logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error = err
, height = bundle.height
do not seem to work. I just log them in a string instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aakoshh key values were added to the log
crate in v0.4.21, but our Cargo.lock
is fixed on v0.4.20. We'll need to upgrade before we can adopt them without using unlocking the unstable_kv
feature.
ipc/provider/src/checkpoint.rs
Outdated
.await | ||
.map_err(|e| { | ||
anyhow!( | ||
"cannot submit bottom up checkpoint at height {} due to: {e:}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cannot submit bottom up checkpoint at height {} due to: {e:}", | |
"cannot submit bottom up checkpoint at height {} due to: {e:?}", |
Not sure what {e:}
is but I suspect it's the same as {e}
. There is also {e:#}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from original code...
I changed it to just {}
.
"submitted bottom up checkpoint({}) in parent at height {}", | ||
event.height, | ||
epoch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"submitted bottom up checkpoint({}) in parent at height {}", | |
event.height, | |
epoch | |
height = event.height, epoch, "submitted bottom up checkpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also from original code.
Your code doesn't compile? height
is not recognized in this log::info!
macro. I end up putting them in the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK for a start, but it's closer to a batched submission than to actual parallelism. The goal was to have max-parallelism
active threads submitting checkpoints at all times.
EDIT: apologies, I think I misread a loop condition. This indeed is performing parallel submissions!
Could you please comment on how this was tested?
ipc/provider/src/checkpoint.rs
Outdated
log::error!("Fail to submit checkpoint at height {height}: {e}"); | ||
drop(submission_permit); | ||
Err(e) | ||
} else { | ||
drop(submission_permit); | ||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all we're doing is log, we can use inspect_err
.
I tested it by running the relayer locally by adding much more debug log (not checked in), and had this observation:
I think these observations is what we expected from the change. |
This closes ENG-767.
We now support submitting bottum-up checkpoint with limited parallelism, configured by
--max-parallel-submission
flag inipc-cli checkpoint relayer
command.