Support submit checkpoint in parallel #840

mb1896 · 2024-03-25T07:06:50Z

This closes ENG-767.

We now support submitting bottum-up checkpoint with limited parallelism, configured by --max-parallel-submission flag in ipc-cli checkpoint relayer command.

linear · 2024-03-25T07:06:55Z

ENG-767 ipc relayer: parallelize checkpoint submission during catch-up

cryptoAtwill · 2024-03-25T08:03:33Z

ipc/provider/src/checkpoint.rs

-                    "submitted bottom up checkpoint({}) in parent at height {}",
-                    event.height,
-                    epoch
+                    .unwrap();


Just curious about the behaviour when the number of checkpoints to submit (say 10) is actually more than number of permits (say 2). So acquire_owned will actually block if 2 checkpoints are being submitted until one of 2 actually finishes? Even if submit_checkpoint fails and the thread crashes, drop(submission_permit) will be executed and no dead lock here right?

So acquire_owned will actually block if 2 checkpoints are being submitted until one of 2 actually finishes?

Yes, it will literally block (the for loop just pauses).

Even if submit_checkpoint fails and the thread crashes, drop(submission_permit) will be executed and no dead lock here right?

My last version had an issue where the async task can panic before dropping the permit (because the use of unwrap()). It's been fixed here so the permit will be dropped no matter it succeeds or fails when submitting the checkpoint.

cryptoAtwill · 2024-03-25T08:07:20Z

ipc/provider/src/checkpoint.rs

-                    epoch
+                    .unwrap();
+                all_submit_tasks.push(tokio::task::spawn(async move {
+                    Self::submit_checkpoint(parent_handler_clone, submitter, bundle, event).await;


Will this actually work in parallel, because there is a sequential enforcement in the checkpoint height, so say we are submitting two checkpoints at height 100 and 130. If height 100 is not submitted, 130 cannot be submitted. My question is really about sort of race in the sense that if height 100 is still in the memory pool and the transaction for height 130 is submitted, will it pass the gas estimation without errors? Or will it be rejected due to nonce not set incrementally?

I'm not entirely sure whether the submit of 130 can succeed if the submit of 100 is still in progress...@aakoshh Can you share your opinions here?

I don't know how Lotus works exactly, but let's say that it:

limits the number of pending transactions in the mempool to 4

orders the pending transactions by nonce

applies them to an in-memory check-state like fendermint

If that were true, then as long as the transactions are created in a logical order, their order could be restored by Lotus when they are included in the block. As for the checks, it probably depends on whether that particular node has seen the preceding transactions are not.

Note that we would still send the transactions sequentially, we just don't want to wait for the receipt between each submission.

Indeed, blockchain nodes accept gapped transactions via their RPC and via the mpool, because they live in an eventually consistent universe. Note that any of these can happen, in addition to other scenarios, of course:

The user submits transactions in sequence to a load-balanced RPC endpoint, and they land gapped in the backend nodes.

The user submits out of order transactions to the same node, which is just a subcase of the gapped situation.

The user submits many transactions, all in order, to the same backend node, but each tx takes different gossip propagation routes through the network and it arrives to various nodes in various non-sequential and gapped orders.

ipc/provider/src/checkpoint.rs

aakoshh · 2024-03-26T12:09:27Z

ipc/provider/src/checkpoint.rs

+                        Err(_err) => {
+                            log::debug!("Failed to submit checkpoint");


Suggested change

Err(_err) => {

log::debug!("Failed to submit checkpoint");

Err(err) => {

log::error!(error = err, height = bundle.height, "Failed to submit checkpoint");

Not sure how to log the height, but ignoring errors and even demoting the log to debug level is a recipe for frustration.

Agreed. Changed to error level logging.

The error = err, height = bundle.height do not seem to work. I just log them in a string instead.

@aakoshh key values were added to the log crate in v0.4.21, but our Cargo.lock is fixed on v0.4.20. We'll need to upgrade before we can adopt them without using unlocking the unstable_kv feature.

aakoshh · 2024-03-26T12:10:53Z

ipc/provider/src/checkpoint.rs

+            .await
+            .map_err(|e| {
+                anyhow!(
+                    "cannot submit bottom up checkpoint at height {} due to: {e:}",


Suggested change

"cannot submit bottom up checkpoint at height {} due to: {e:}",

"cannot submit bottom up checkpoint at height {} due to: {e:?}",

Not sure what {e:} is but I suspect it's the same as {e}. There is also {e:#}.

This is from original code...
I changed it to just {}.

aakoshh · 2024-03-26T12:11:47Z

ipc/provider/src/checkpoint.rs

+            "submitted bottom up checkpoint({}) in parent at height {}",
+            event.height,
+            epoch


Suggested change

"submitted bottom up checkpoint({}) in parent at height {}",

event.height,

epoch

height = event.height, epoch, "submitted bottom up checkpoint"

This is also from original code.
Your code doesn't compile? height is not recognized in this log::info! macro. I end up putting them in the string.

ipc/provider/src/checkpoint.rs

raulk

~~This is OK for a start, but it's closer to a batched submission than to actual parallelism. The goal was to have max-parallelism active threads submitting checkpoints at all times.~~

EDIT: apologies, I think I misread a loop condition. This indeed is performing parallel submissions!

Could you please comment on how this was tested?

raulk · 2024-04-01T18:06:20Z

ipc/provider/src/checkpoint.rs

+                        log::error!("Fail to submit checkpoint at height {height}: {e}");
+                        drop(submission_permit);
+                        Err(e)
+                    } else {
+                        drop(submission_permit);
+                        Ok(())
+                    }


If all we're doing is log, we can use inspect_err.

mb1896 · 2024-04-01T18:54:23Z

~~This is OK for a start, but it's closer to a batched submission than to actual parallelism. The goal was to have max-parallelism active threads submitting checkpoints at all times.~~

EDIT: apologies, I think I misread a loop condition. This indeed is performing parallel submissions!

Could you please comment on how this was tested?

I tested it by running the relayer locally by adding much more debug log (not checked in), and had this observation:

Before reaching the parallelism limit, there are no "context switch" ("context switch" here means switching to executing another async code block at await, not switching threads) when acquiring the permit from semaphore. I observed is that all permit will be quickly acquired. No submission results ever shows at this point
Then in the next loop, context switch happened when trying to acquire new permit, one submission result showed, then another context switch happened and a new permit is acquired and a new submission happened. If there's no response from the earliest submission that's currently happening, it will block there waiting for permit
No deadlock happened and all permit are eventually dropped

I think these observations is what we expected from the change.

mb1896 marked this pull request as ready for review March 25, 2024 07:07

mb1896 requested review from raulk, aakoshh and cryptoAtwill March 25, 2024 07:09

cryptoAtwill reviewed Mar 25, 2024

View reviewed changes

aakoshh reviewed Mar 26, 2024

View reviewed changes

ipc/provider/src/checkpoint.rs Outdated Show resolved Hide resolved

aakoshh reviewed Mar 26, 2024

View reviewed changes

ipc/provider/src/checkpoint.rs Show resolved Hide resolved

mb1896 added 8 commits March 28, 2024 09:09

Implement it

c8bfcd8

Remove debug log

9b4a643

Lint fix

2abd2c0

Expose max parallelism as cmd parameter

84d83d7

More lint

4cd54de

Make sure we always drop permit before finish async task

8137c77

Lint fix

6c9a93e

Address Akosh's comments

a46a84b

mb1896 force-pushed the ENG-767 branch from 045d392 to a46a84b Compare March 28, 2024 16:09

mb1896 requested a review from aakoshh March 28, 2024 16:11

mb1896 and others added 3 commits March 28, 2024 09:12

Lint

4752714

rename cli flag to --max-parallelism.

143b539

Merge branch 'main' into ENG-767

a406d91

raulk approved these changes Apr 1, 2024

View reviewed changes

use inspect_err.

207d6e0

mb1896 merged commit 3c42006 into main Apr 1, 2024
19 checks passed

mb1896 deleted the ENG-767 branch April 1, 2024 21:55

raulk mentioned this pull request Oct 2, 2024

Subnet relayer attempts spurious checkpoints when behind #1160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support submit checkpoint in parallel #840

Support submit checkpoint in parallel #840

mb1896 commented Mar 25, 2024

linear bot commented Mar 25, 2024

cryptoAtwill Mar 25, 2024

mb1896 Mar 25, 2024

cryptoAtwill Mar 25, 2024

mb1896 Mar 25, 2024

aakoshh Mar 26, 2024

raulk Apr 1, 2024

aakoshh Mar 26, 2024

mb1896 Mar 28, 2024

mb1896 Mar 28, 2024

raulk Apr 1, 2024 •

edited

Loading

aakoshh Mar 26, 2024

mb1896 Mar 28, 2024

aakoshh Mar 26, 2024

mb1896 Mar 28, 2024

raulk left a comment •

edited

Loading

raulk Apr 1, 2024

mb1896 commented Apr 1, 2024 •

edited

Loading

	"cannot submit bottom up checkpoint at height {} due to: {e:}",
	"cannot submit bottom up checkpoint at height {} due to: {e:?}",

Support submit checkpoint in parallel #840

Support submit checkpoint in parallel #840

Conversation

mb1896 commented Mar 25, 2024

linear bot commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulk Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulk left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mb1896 commented Apr 1, 2024 • edited Loading

raulk Apr 1, 2024 •

edited

Loading

raulk left a comment •

edited

Loading

mb1896 commented Apr 1, 2024 •

edited

Loading