Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[State Sync] Support transaction syncing from the account bootstrap version #318

Merged
merged 2 commits into from
Mar 30, 2022
Merged

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Mar 25, 2022

Motivation

This PR updates the state sync driver to support transaction/output syncing from the account bootstrap version. For example, if a node downloads all account states at a version V, the driver will begin syncing transactions/outputs from V+1. The flow is as follows:

  1. The state sync bootstrapper will identify the most recent epoch ending version V and fetch the epoch ending ledger info at V.
  2. The bootstrapper will then fetch a transaction output list with proof at version V (to identify the expected state root hash and bootstrap the account merkle accumulator).
  3. The bootstrapper will fetch and stream all account states at version V and write the accounts to the database via the state_snapshot_receiver. Once all account states are written, the chunk executor will be reset (to force a read from the db).
  4. The continuous syncer will see that the node has bootstrapped at version V and start syncing from V+1.

The PR offers the following commits:

  1. Add a reset() method to the ChunkExecutorTrait. This is so that we can reset the executor after performing an account state sync.
  2. Refactor the DB backup functionality into a restore_utils module so that we can share it between db-restore and state sync and update the state sync driver to start syncing from the bootstrapping version.

Remaining steps before this is finalized:

  • Update the state sync code to ensure all proofs are being checked correctly and add a test to ensure bootstrapped nodes can serve correct proofs.
  • Various small clean-ups and improvements (e.g., to the Aptos Data Client polling logic)
  • Chaos, failure and adversary testing (!)

Have you read the Contributing Guidelines on pull requests?

Yes.

Test Plan

There is a smoke test that covers some of this functionality. I've also manually inspected the logs to ensure the correct execution paths are taken. However, there's a still of bunch of testing that needs to happen in relation to proofs and malicious data responses.

Related PRs

None, but this PR relates to: #245

) -> Result<Box<dyn StateSnapshotReceiver<AccountStateBlob>>> {
gauged_api("get_state_snapshot_receiver", || {
// Get the target version and expected root hash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msmouse -- this is the code that I mostly want to ensure is correct, i.e., once we've identified a version V to download all account states, we update the various data stores and start downloading all the states. Once all states are downloaded, the db should be in the correct state to move forward. From the smoke test, I don't see any complaints from the executor, but let me know if there's a data store or something that's missed 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added logic doesn't really belong here.. (it's more specific to state-sync)

At least give it a right name if you see the whole set of logic indeed belongs to the same function?

Is it that the helpers requires access to the stores so you have to do it on the DB side? How about refactor the restore_handler and use it from state sync? (I think the handler was created so that DB internal details are not exposed to the restore tooling)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't quite understand what this is doing -- the txns are AT OR AFTER the state snapshot version, right? In my mind, we should do the snapshot first and then the ledger. Also strictly speaking, the txn, txn_out and events AT the version should be skipped, right? (you don't need them to start applying the next version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I'll handle this by exposing an additional method in DbWriter that initializes all state once the accounts have all successfully synced 😄

@@ -261,6 +256,11 @@ impl<V: VMExecutor> ChunkExecutorTrait for ChunkExecutor<V> {
)?;
self.commit_chunk()
}

fn reset(&self) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always hated the reset() interface, and was willing to kill it after the legacy interfaces x_and_y() gets removed together with state sync v1. Is it possible on the call site to just recreate the ChunkExecutor?

(Doesn't feel too strong either, probably not worth a huge refactor.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's gonna be a big-ish refactor to recreate it. If you're okay with it, let's leave it as is for now. We can clean it up when we come to handle the fine grain storage changes.

Comment on lines 20 to 22
/// This file contains utilities that are helpful for performing
/// database restore operations, as required by db-restore and
/// state sync v2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comments for the whole file better be with "///!" (and maybe brought up before the uses?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe call it "restore_utils" to be a bit more explicit.

// The channel through which to notify the state snapshot receiver of new data chunks
state_snapshot_notifier: Option<mpsc::Sender<StorageDataChunk>>,

// The writer to storage (required for account state syncing)
storage: Arc<dyn DbWriter>,
}

impl StorageSynchronizer {
pub fn new<ChunkExecutor: ChunkExecutorTrait + 'static>(
// TODO(joshlind): the compiler isn't happy with automatically deriving this...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol.. any idea why?

Copy link
Contributor Author

@JoshLind JoshLind Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's because deriving Clone only really works for simple cases. There's a bunch of cases where it doesn't, e.g.,: rust-lang/rust#64417 and rust-lang/rust#26925. In our case, we have to manually define it. I'll update the comment to mention this.

Comment on lines 1351 to 1356
// Save the target ledger info
utils::save_ledger_infos(
self.db.clone(),
self.ledger_store.clone(),
&[target_ledger_info],
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we should really make sure all the epoch ending ledger infos to be available in every node -- assuming we don't publish any waypoints other than the genesis, trust from any downstream can only be established with all of those available.

(or maybe I misread it, and you did that outside of this?)

Copy link
Contributor Author

@JoshLind JoshLind Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so this already happens in the bootstrapper. Before we get to syncing any data, the bootstrapper will fetch all epoch changes and verify them, e.g.,: https://github.com/aptos-labs/aptos-core/blob/main/state-sync/state-sync-v2/state-sync-driver/src/bootstrapper.rs#L32. It will also make sure the waypoint is verified along the way.

This function only gets given the most recent to write to the DB. So, we're holding all epoch changes in memory but only writing the latest when we do the account state sync. Does this make sense?

) -> Result<Box<dyn StateSnapshotReceiver<AccountStateBlob>>> {
gauged_api("get_state_snapshot_receiver", || {
// Get the target version and expected root hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added logic doesn't really belong here.. (it's more specific to state-sync)

At least give it a right name if you see the whole set of logic indeed belongs to the same function?

Is it that the helpers requires access to the stores so you have to do it on the DB side? How about refactor the restore_handler and use it from state sync? (I think the handler was created so that DB internal details are not exposed to the restore tooling)

) -> Result<Box<dyn StateSnapshotReceiver<AccountStateBlob>>> {
gauged_api("get_state_snapshot_receiver", || {
// Get the target version and expected root hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't quite understand what this is doing -- the txns are AT OR AFTER the state snapshot version, right? In my mind, we should do the snapshot first and then the ledger. Also strictly speaking, the txn, txn_out and events AT the version should be skipped, right? (you don't need them to start applying the next version)

@netlify
Copy link

netlify bot commented Mar 28, 2022

Deploy Preview for aptos-developer-docs canceled.

Name Link
🔨 Latest commit 57d8f54
🔍 Latest deploy log https://app.netlify.com/sites/aptos-developer-docs/deploys/6243a5a6fde48600085cc9f3

@JoshLind
Copy link
Contributor Author

JoshLind commented Mar 29, 2022

Thanks @msmouse, this is ready for another look. 😄 As discussed, the notable changes:

  • get_state_snapshot_receiver just returns a receiver (as it is, today).
  • Once all account states are written to the db, we call a new finalize method in the DBWriter. We then also save all epoch ending ledger infos from the current epoch to the synced epoch (so we ensure all nodes have all epoch changes).
  • I've renamed utils -> restore_utils.

Let me know if I've missed anything!

@JoshLind JoshLind requested a review from msmouse March 29, 2022 01:14
version: Version,
output_with_proof: TransactionOutputListWithProof,
) -> Result<()> {
// Update the merkle accumulator using the given proof
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I ask is to assert output_with_proof is of length 1, at this point.
And document it in the interface.

Better if we pass in types that imply that.. :D

/// Get a (stateful) state snapshot receiver.
///
/// Chunk of accounts need to be added via `add_chunk()` before finishing up with `finish_box()`
fn get_state_snapshot_receiver(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshLind - You might want to rebase as this API has changed, we no longer expose AccountStateBlob in the DbWriter anymore.

@JoshLind
Copy link
Contributor Author

Thanks, all!

/land

@aptos-bot
Copy link
Contributor

@JoshLind ❗ Unable to run the provided command on a closed PR

@JoshLind
Copy link
Contributor Author

What you talking about Aptos-bot? :P

/land

@aptos-bot
Copy link
Contributor

Forge run: https://circleci.com/gh/aptos-labs/aptos-core/6436
Forge Test Result: all up : 1714 TPS, 2547 ms latency, 3750 ms p99 latency,no expired txns

@aptos-bot aptos-bot closed this in 57d8f54 Mar 30, 2022
@aptos-bot aptos-bot merged commit 57d8f54 into aptos-labs:main Mar 30, 2022
@aptos-bot
Copy link
Contributor

Forge run: https://circleci.com/gh/aptos-labs/aptos-core/6457
Forge Test Result: all up : 1790 TPS, 2444 ms latency, 4150 ms p99 latency,no expired txns

@aptos-bot
Copy link
Contributor

Forge run: https://circleci.com/gh/aptos-labs/aptos-core/6470
Forge Test Result: ``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants