[State Sync] Support transaction syncing from the account bootstrap version #318

JoshLind · 2022-03-25T15:10:11Z

Motivation

This PR updates the state sync driver to support transaction/output syncing from the account bootstrap version. For example, if a node downloads all account states at a version V, the driver will begin syncing transactions/outputs from V+1. The flow is as follows:

The state sync bootstrapper will identify the most recent epoch ending version V and fetch the epoch ending ledger info at V.
The bootstrapper will then fetch a transaction output list with proof at version V (to identify the expected state root hash and bootstrap the account merkle accumulator).
The bootstrapper will fetch and stream all account states at version V and write the accounts to the database via the state_snapshot_receiver. Once all account states are written, the chunk executor will be reset (to force a read from the db).
The continuous syncer will see that the node has bootstrapped at version V and start syncing from V+1.

The PR offers the following commits:

Add a reset() method to the ChunkExecutorTrait. This is so that we can reset the executor after performing an account state sync.
Refactor the DB backup functionality into a restore_utils module so that we can share it between db-restore and state sync and update the state sync driver to start syncing from the bootstrapping version.

Remaining steps before this is finalized:

Update the state sync code to ensure all proofs are being checked correctly and add a test to ensure bootstrapped nodes can serve correct proofs.
Various small clean-ups and improvements (e.g., to the Aptos Data Client polling logic)
Chaos, failure and adversary testing (!)

Have you read the Contributing Guidelines on pull requests?

Yes.

Test Plan

There is a smoke test that covers some of this functionality. I've also manually inspected the logs to ensure the correct execution paths are taken. However, there's a still of bunch of testing that needs to happen in relation to proofs and malicious data responses.

Related PRs

None, but this PR relates to: #245

JoshLind · 2022-03-25T15:16:42Z

storage/aptosdb/src/lib.rs

    ) -> Result<Box<dyn StateSnapshotReceiver<AccountStateBlob>>> {
        gauged_api("get_state_snapshot_receiver", || {
+            // Get the target version and expected root hash


@msmouse -- this is the code that I mostly want to ensure is correct, i.e., once we've identified a version V to download all account states, we update the various data stores and start downloading all the states. Once all states are downloaded, the db should be in the correct state to move forward. From the smoke test, I don't see any complaints from the executor, but let me know if there's a data store or something that's missed 😄

The added logic doesn't really belong here.. (it's more specific to state-sync)

At least give it a right name if you see the whole set of logic indeed belongs to the same function?

Is it that the helpers requires access to the stores so you have to do it on the DB side? How about refactor the restore_handler and use it from state sync? (I think the handler was created so that DB internal details are not exposed to the restore tooling)

Also I don't quite understand what this is doing -- the txns are AT OR AFTER the state snapshot version, right? In my mind, we should do the snapshot first and then the ledger. Also strictly speaking, the txn, txn_out and events AT the version should be skipped, right? (you don't need them to start applying the next version)

As discussed offline, I'll handle this by exposing an additional method in DbWriter that initializes all state once the accounts have all successfully synced 😄

msmouse · 2022-03-26T00:10:08Z

execution/executor/src/chunk_executor.rs

@@ -261,6 +256,11 @@ impl<V: VMExecutor> ChunkExecutorTrait for ChunkExecutor<V> {
        )?;
        self.commit_chunk()
    }
+
+    fn reset(&self) -> Result<()> {


I always hated the reset() interface, and was willing to kill it after the legacy interfaces x_and_y() gets removed together with state sync v1. Is it possible on the call site to just recreate the ChunkExecutor?

(Doesn't feel too strong either, probably not worth a huge refactor.)

Yeah, I think it's gonna be a big-ish refactor to recreate it. If you're okay with it, let's leave it as is for now. We can clean it up when we come to handle the fine grain storage changes.

msmouse · 2022-03-26T00:15:47Z

storage/aptosdb/src/backup/utils.rs

+/// This file contains utilities that are helpful for performing
+/// database restore operations, as required by db-restore and
+/// state sync v2.


nit: comments for the whole file better be with "///!" (and maybe brought up before the uses?)

nit: maybe call it "restore_utils" to be a bit more explicit.

msmouse · 2022-03-26T00:23:51Z

state-sync/state-sync-v2/state-sync-driver/src/storage_synchronizer.rs

    // The channel through which to notify the state snapshot receiver of new data chunks
    state_snapshot_notifier: Option<mpsc::Sender<StorageDataChunk>>,

    // The writer to storage (required for account state syncing)
    storage: Arc<dyn DbWriter>,
 }

-impl StorageSynchronizer {
-    pub fn new<ChunkExecutor: ChunkExecutorTrait + 'static>(
+// TODO(joshlind): the compiler isn't happy with automatically deriving this...


lol.. any idea why?

Yeah, it's because deriving Clone only really works for simple cases. There's a bunch of cases where it doesn't, e.g.,: rust-lang/rust#64417 and rust-lang/rust#26925. In our case, we have to manually define it. I'll update the comment to mention this.

msmouse · 2022-03-26T00:31:59Z

storage/aptosdb/src/lib.rs

+            // Save the target ledger info
+            utils::save_ledger_infos(
+                self.db.clone(),
+                self.ledger_store.clone(),
+                &[target_ledger_info],
+            )?;


At this point we should really make sure all the epoch ending ledger infos to be available in every node -- assuming we don't publish any waypoints other than the genesis, trust from any downstream can only be established with all of those available.

(or maybe I misread it, and you did that outside of this?)

Yes, so this already happens in the bootstrapper. Before we get to syncing any data, the bootstrapper will fetch all epoch changes and verify them, e.g.,: https://github.com/aptos-labs/aptos-core/blob/main/state-sync/state-sync-v2/state-sync-driver/src/bootstrapper.rs#L32. It will also make sure the waypoint is verified along the way.

This function only gets given the most recent to write to the DB. So, we're holding all epoch changes in memory but only writing the latest when we do the account state sync. Does this make sense?

msmouse · 2022-03-26T00:33:21Z

storage/aptosdb/src/lib.rs

    ) -> Result<Box<dyn StateSnapshotReceiver<AccountStateBlob>>> {
        gauged_api("get_state_snapshot_receiver", || {
+            // Get the target version and expected root hash


The added logic doesn't really belong here.. (it's more specific to state-sync)

At least give it a right name if you see the whole set of logic indeed belongs to the same function?

Is it that the helpers requires access to the stores so you have to do it on the DB side? How about refactor the restore_handler and use it from state sync? (I think the handler was created so that DB internal details are not exposed to the restore tooling)

msmouse · 2022-03-26T00:39:39Z

storage/aptosdb/src/lib.rs

    ) -> Result<Box<dyn StateSnapshotReceiver<AccountStateBlob>>> {
        gauged_api("get_state_snapshot_receiver", || {
+            // Get the target version and expected root hash


Also I don't quite understand what this is doing -- the txns are AT OR AFTER the state snapshot version, right? In my mind, we should do the snapshot first and then the ledger. Also strictly speaking, the txn, txn_out and events AT the version should be skipped, right? (you don't need them to start applying the next version)

netlify · 2022-03-28T21:05:18Z

✅ Deploy Preview for aptos-developer-docs canceled.

Name	Link
🔨 Latest commit	`57d8f54`
🔍 Latest deploy log	https://app.netlify.com/sites/aptos-developer-docs/deploys/6243a5a6fde48600085cc9f3

JoshLind · 2022-03-29T01:14:22Z

Thanks @msmouse, this is ready for another look. 😄 As discussed, the notable changes:

get_state_snapshot_receiver just returns a receiver (as it is, today).
Once all account states are written to the db, we call a new finalize method in the DBWriter. We then also save all epoch ending ledger infos from the current epoch to the synced epoch (so we ensure all nodes have all epoch changes).
I've renamed utils -> restore_utils.

Let me know if I've missed anything!

msmouse · 2022-03-29T17:55:54Z

storage/aptosdb/src/lib.rs

+        version: Version,
+        output_with_proof: TransactionOutputListWithProof,
+    ) -> Result<()> {
+        // Update the merkle accumulator using the given proof


All I ask is to assert output_with_proof is of length 1, at this point.
And document it in the interface.

Better if we pass in types that imply that.. :D

sitalkedia · 2022-03-29T18:19:36Z

storage/storage-interface/src/lib.rs

+    /// Get a (stateful) state snapshot receiver.
+    ///
+    /// Chunk of accounts need to be added via `add_chunk()` before finishing up with `finish_box()`
+    fn get_state_snapshot_receiver(


@JoshLind - You might want to rebase as this API has changed, we no longer expose AccountStateBlob in the DbWriter anymore.

JoshLind · 2022-03-29T23:42:57Z

Thanks, all!

/land

aptos-bot · 2022-03-29T23:42:59Z

@JoshLind ❗ Unable to run the provided command on a closed PR

JoshLind · 2022-03-29T23:45:42Z

What you talking about Aptos-bot? :P

/land

…iter Closes: #318

aptos-bot · 2022-03-30T00:34:40Z

Forge run: https://circleci.com/gh/aptos-labs/aptos-core/6436
Forge Test Result: all up : 1714 TPS, 2547 ms latency, 3750 ms p99 latency,no expired txns

aptos-bot · 2022-03-30T01:04:32Z

Forge run: https://circleci.com/gh/aptos-labs/aptos-core/6457
Forge Test Result: all up : 1790 TPS, 2444 ms latency, 4150 ms p99 latency,no expired txns

aptos-bot · 2022-03-30T01:05:40Z

Forge run: https://circleci.com/gh/aptos-labs/aptos-core/6470
Forge Test Result: ``

JoshLind requested review from davidiw, msmouse, sitalkedia and zekun000 March 25, 2022 15:10

JoshLind commented Mar 25, 2022

View reviewed changes

msmouse reviewed Mar 26, 2022

View reviewed changes

JoshLind requested a review from msmouse March 29, 2022 01:14

msmouse approved these changes Mar 29, 2022

View reviewed changes

sitalkedia reviewed Mar 29, 2022

View reviewed changes

JoshLind added 2 commits March 29, 2022 23:45

[State Sync] Add a reset() method to the ChunkExecutorTrait.

db17f80

[State Sync] Refactor common backup functionality and expose via DbWr…

57d8f54

…iter Closes: #318

aptos-bot closed this in 57d8f54 Mar 30, 2022

aptos-bot merged commit 57d8f54 into aptos-labs:main Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[State Sync] Support transaction syncing from the account bootstrap version #318

[State Sync] Support transaction syncing from the account bootstrap version #318

JoshLind commented Mar 25, 2022 •

edited

Loading

JoshLind Mar 25, 2022

msmouse Mar 26, 2022

msmouse Mar 26, 2022

JoshLind Mar 28, 2022

msmouse Mar 26, 2022

JoshLind Mar 28, 2022

msmouse Mar 26, 2022

msmouse Mar 26, 2022

msmouse Mar 26, 2022

JoshLind Mar 28, 2022 •

edited

Loading

msmouse Mar 26, 2022

JoshLind Mar 28, 2022 •

edited

Loading

msmouse Mar 26, 2022

msmouse Mar 26, 2022

netlify bot commented Mar 28, 2022 •

edited

Loading

JoshLind commented Mar 29, 2022 •

edited

Loading

msmouse Mar 29, 2022

sitalkedia Mar 29, 2022

JoshLind commented Mar 29, 2022

aptos-bot commented Mar 29, 2022

JoshLind commented Mar 29, 2022

aptos-bot commented Mar 30, 2022

aptos-bot commented Mar 30, 2022

aptos-bot commented Mar 30, 2022

[State Sync] Support transaction syncing from the account bootstrap version #318

[State Sync] Support transaction syncing from the account bootstrap version #318

Conversation

JoshLind commented Mar 25, 2022 • edited Loading

Motivation

Have you read the Contributing Guidelines on pull requests?

Test Plan

Related PRs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind Mar 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind Mar 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Mar 28, 2022 • edited Loading

✅ Deploy Preview for aptos-developer-docs canceled.

JoshLind commented Mar 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind commented Mar 29, 2022

aptos-bot commented Mar 29, 2022

JoshLind commented Mar 29, 2022

aptos-bot commented Mar 30, 2022

aptos-bot commented Mar 30, 2022

aptos-bot commented Mar 30, 2022

JoshLind commented Mar 25, 2022 •

edited

Loading

JoshLind Mar 28, 2022 •

edited

Loading

JoshLind Mar 28, 2022 •

edited

Loading

netlify bot commented Mar 28, 2022 •

edited

Loading

JoshLind commented Mar 29, 2022 •

edited

Loading