agd does not support joining with state sync #3769

dckc · 2021-08-27T04:27:47Z

Describe the bug

While there is a practice of sharing informal snapshots, the only in-protocol way to join an Agoric chain, currently, is to replay all transactions from genesis; this may take days or weeks. Contrast this with the norm in the Cosmos community:

With block sync a node is downloading all of the data of an application from genesis and verifying it. With state sync your node will download data related to the head or near the head of the chain and verify the data. This leads to drastically shorter times for joining a network.
-- State Sync | Tendermint Core

Other blockchain systems have similar features. In Bitcoin and Ethereum, software releases include a hash of a known-good state; this way, new nodes can download a state that is not more than a few months old and start verifying from there.

Design Notes

consensus on swingset kernel DB state: currently, the swingset DB state is not part of consensus; only the sequence of messages. Mnemonic: "you can think whatever you want, as long as you say the same thing that everyone else says". Fast sync most likely requires including (most of) the KB state in consensus Merkle tree proofs.
consensus on xsnap snapshots: currently, XS snapshots are not part of consensus; we don't require that all validators deterministically get exactly the same bytes in their snapshots. (In particular, xsnap + SES boot not deterministic #2776 is an observed case of non-determinism in snapshots). Fast sync most likely requires that we include snapshots in consensus.
cosmos-sdk hooks to publish swingset state: baking our own system is undesirable; we just need some hooks to be able to leverage the Cosmos/Tendermint mechanisms for shipping states from select RPC nodes to the joining node (Add hooks to allow app modules to add things to state-sync cosmos/cosmos-sdk#7340 (comment))

cc @michaelfig @erights

rowgraus · 2021-08-27T17:04:16Z

Conservatively, I am putting it on phase 1. Feel free to postpone it as you see fit.

Tend to agree unless there are good arguments for postponing this

dckc · 2021-10-22T18:01:44Z

@dtribble suggests that as long as catching up is 3x to 5x faster than the running chain, we can postpone this to a later milestone.

possible optimization:

replay from snapshot in parallel

(I think we only replay on-line vats, of which there is a bounded number, so that optimization doesn't seem worthwhile)

Tartuffo · 2022-01-27T00:57:10Z

First need to measure how long it currently takes based on estimated number of blocks, and or calc the blocks / time of recovery. Create sub-ticket for this initial measurement.

dckc · 2022-03-11T02:59:50Z

some mainnet0 data shows new nodes should eventually catch up, but it takes a long time.
#4106 (comment)

The validator community seems to prefer informal snapshot sharing.
Agoric/testnet-notes#42

mhofman · 2022-03-14T18:43:56Z

A few quick thoughts:

Without state sync, catching up a new validator relies on the same execution paths as a live execution. Whatever work we would do to optimize the time it takes to catch up from scratch, especially at the SwingSet / JS level, would be optimization we'd do for normal execution.
That means if we ever get to a fairly high utilization* (regardless of parallelization), it won't be possible to catch up by mere replay in a meaningful amount of time (unless you throw a lot more compute power at the catchup node, to reduce utilization)
During the catch up time, the chain continues to make progress, which has to be caught up to as well. IOU a formula

*: My understanding is that the meaningful number to measure is the amount of time spent in Swingset, compared to time elapsed since genesis. That gives Swingset utilization. If you consider the cosmos processing to be negligible comparatively, you can then calculate time it'd take to rebuild all the JS state through catch up. It also gives a lower bound.

erights · 2022-03-14T20:13:29Z

Can we replay each vat separately and in parallel?

michaelfig · 2022-03-15T04:02:51Z

Can we replay each vat separately and in parallel?

We need "state sync" to jump to a snapshot of the kernel data close to the current block or else we can only replay then verify a single block at a time since genesis. That's really slow, even if we do more in parallel.

dckc · 2022-03-15T18:36:28Z

@warner further to the discussion we just had about trade-offs between performance and integrity of snapshots, as I mentioned, our validator community is doing some informal snapshot sharing currently: Agoric/testnet-notes#42.

I looked around and found that it seems to take about 3.5min of downtime to do a daily mainnet0 snapshot.
Some follow-up step to make the snapshot available seems to take significantly longer; I'm not sure what going on in there...

---------------------------

|2022-03-14_01:00:01| LAST_BLOCK_HEIGHT 4103449
|2022-03-14_01:00:01| Stopping agoric.service
0
|2022-03-14_01:00:01| Creating new snapshot
|2022-03-14_01:03:33| Starting agoric.service
0
|2022-03-14_01:03:33| Moving new snapshot to /home/snapshots/data/agoric
155G	/home/snapshots/snaps/agoric_2022-03-14.tar
|2022-03-14_02:19:47| Done
---------------------------

|2022-03-15_01:00:01| LAST_BLOCK_HEIGHT 4116868
|2022-03-15_01:00:01| Stopping agoric.service
0
|2022-03-15_01:00:01| Creating new snapshot
|2022-03-15_01:03:27| Starting agoric.service
0
|2022-03-15_01:03:27| Moving new snapshot to /home/snapshots/data/agoric
156G	/home/snapshots/snaps/agoric_2022-03-15.tar
|2022-03-15_03:06:14| Done
---------------------------

-- https://snapshots.stake2.me/agoric/agoric_log.txt

dckc · 2022-11-02T19:02:28Z

one validator notes:

the post upgrade mainnet snapshot is already nearly 2GB in size.

warner · 2022-11-03T21:51:28Z

One data point: I watched a node crash today, it missed about 200s before getting restarted. The restart took 2m10s to replay vat transcripts enough to begin processing blocks again, then took another 33s to replay the 95-ish (empty) missed blocks, after which is was caught up and following properly again.

The vat-transcript replay time is roughly bounded by the frequency of our heap snapshots: we take a heap snapshot every 2000 deliveries, so no single vat should ever need to replay more than 2000 deliveries at reboot time, so reboot time will be random but roughly constant (depends on deliveryNum % 2000 summed across all vats).

Note that this doesn't tell us anything about how long it takes to start up a whole new validator from scratch.

mhofman · 2022-11-14T00:24:10Z

After discussing state-sync the other day, @arirubinstein mentioned validators leverage state sync to work around a cosmos DB pruning issue: they start a new node state syncing from their existing node to prune their DB.

In case for some reason we can't figure out state sync by the time the DBs grows out too large, we should check if the following rough hack may work:

shut down node at height of a block N that won't be pruned from cosmos DB.
Make copy of swingset state dir
Restart node
Start new node from swingset state copy, and using cosmos state-sync at same block height N to re-populate the IAVL tree.

For consistency protection, Swingset saves the block height it last committed, and checks that the next block it sees is either the next block N + 1, or the same block N (in which case it doesn't execute anything but simply replays calls it previously made back to the go / cosmos side).

dckc · 2022-11-28T18:16:25Z

... as long as catching up is 3x to 5x faster than the running chain ...

a recent data point: 26hrs to catch up on 26 chain days. So 24x.

dckc added bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool labels Aug 27, 2021

dckc added this to the Mainnet: Phase 1 - Treasury Launch milestone Aug 27, 2021

dckc assigned warner, dckc and rowgraus Aug 27, 2021

dckc removed their assignment Aug 30, 2021

michaelfig changed the title ~~joining as a new validator is too slow without fast state sync~~ joining as a new validator is too slow without "state sync" Sep 5, 2021

michaelfig mentioned this issue Sep 5, 2021

cannot restart validator: not connecting to peers Agoric/testnet-notes#26

Closed

warner mentioned this issue Jan 26, 2022

add "flight recorder", triggered by consensus failure #3742

Closed

Tartuffo added the MN-1 label Feb 2, 2022

warner added the needs-design label Feb 7, 2022

Tartuffo removed the MN-1 label Feb 7, 2022

Tartuffo removed this from the Mainnet: Phase 1 - RUN Protocol milestone Feb 8, 2022

Tartuffo unassigned rowgraus Feb 11, 2022

warner mentioned this issue Feb 16, 2022

maybe: HostDB.publishStore: public/provable merkle-tree storage, liveslots-supported Notifiers #4559

Open

dckc mentioned this issue Mar 14, 2022

Need at least one Canary (follower) node on Mainnet #4478

Closed

warner mentioned this issue Mar 21, 2022

kernel API for upgrading vats #1848

Closed

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

Tartuffo modified the milestones: Mainnet 1, RUN Protocol RC0 Apr 5, 2022

dckc changed the title ~~joining as a new validator is too slow without "state sync"~~ agd does not support joining with state sync Oct 15, 2022

dckc mentioned this issue Oct 15, 2022

Validator guides for AgD, release notes #6276

Closed

dckc added performance Performance related issues cosmic-swingset package: cosmic-swingset labels Nov 2, 2022

This was referenced Nov 14, 2022

API to get summary of swingstore block changes #6562

Closed

chore(deps): update source-map@^0.7.3 to 0.7.4 #6560

Merged

Tartuffo added migrate-product-backlog and removed migrate-product-backlog labels Nov 17, 2022

mhofman mentioned this issue Jan 10, 2023

Include extra replay data in vat transcripts #6770

Closed

warner mentioned this issue Jan 10, 2023

swing-store export/restore API (for state-sync) #6773

Closed

warner mentioned this issue Jan 25, 2023

VM "bulldozer" upgrade #6644

Closed

warner assigned mhofman and unassigned warner Jan 30, 2023

rowgraus added this to the Vaults EVP milestone Feb 6, 2023

rowgraus added the vaults_triage DO NOT USE label Feb 6, 2023

mhofman added the Epic label Feb 8, 2023

This was referenced Feb 24, 2023

feat: agoric-sdk state-sync support agoric-labs/cosmos-sdk#297

Merged

Keep pending actions in actionQueue #7102

Closed

dckc mentioned this issue Mar 8, 2023

Office Hours: Agoric Dev Agoric/documentation#576

Closed

This was referenced Mar 13, 2023

idea for giving each worker its own (local) SQLite vatStore DB #6254

Open

feat(runner): Enable starting testnet follower from state-sync Agoric/testnet-load-generator#105

Merged

This was referenced Mar 23, 2023

feat: Support state-sync #7225

Merged

refactor: move around some helpers before state-sync #7258

Merged

dckc mentioned this issue Apr 14, 2023

add state sync to node start documentation Agoric/documentation#792

Closed

mergify bot closed this as completed in #7225 Apr 19, 2023

mhofman mentioned this issue Aug 19, 2023

Persistent cosmic-swingset state #8222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agd does not support joining with state sync #3769

agd does not support joining with state sync #3769

dckc commented Aug 27, 2021 •

edited

Loading

rowgraus commented Aug 27, 2021

dckc commented Oct 22, 2021

Tartuffo commented Jan 27, 2022 •

edited

Loading

dckc commented Mar 11, 2022

mhofman commented Mar 14, 2022

erights commented Mar 14, 2022 via email •

edited by dckc

Loading

michaelfig commented Mar 15, 2022

dckc commented Mar 15, 2022 •

edited

Loading

dckc commented Nov 2, 2022

warner commented Nov 3, 2022

mhofman commented Nov 14, 2022

dckc commented Nov 28, 2022

agd does not support joining with state sync #3769

agd does not support joining with state sync #3769

Comments

dckc commented Aug 27, 2021 • edited Loading

Describe the bug

Design Notes

rowgraus commented Aug 27, 2021

dckc commented Oct 22, 2021

Tartuffo commented Jan 27, 2022 • edited Loading

dckc commented Mar 11, 2022

mhofman commented Mar 14, 2022

erights commented Mar 14, 2022 via email • edited by dckc Loading

michaelfig commented Mar 15, 2022

dckc commented Mar 15, 2022 • edited Loading

dckc commented Nov 2, 2022

warner commented Nov 3, 2022

mhofman commented Nov 14, 2022

dckc commented Nov 28, 2022

dckc commented Aug 27, 2021 •

edited

Loading

Tartuffo commented Jan 27, 2022 •

edited

Loading

erights commented Mar 14, 2022 via email •

edited by dckc

Loading

dckc commented Mar 15, 2022 •

edited

Loading