Interrupted branch exploration prevents any future syncing of that branch #23

hsanjuan · 2020-02-18T20:07:00Z

When a new head is received, syncing for that branch starts until the DAGSyncer returns HasBlock() = true for the current block.

This means that the block we want to process is already local, so we can assume that it was already processed in the past (along with everything below).

However, it may be that we did not finish fully processing a branch, and while we have some of the newer blocks, we never fetched the old ones.

Raw thinking: Currently:

There is no way to force a re-walk of the full DAG.
There is no way to know that we really have the full DAG under a given block (and therefore we've no need to walk it).
Batching the walked nodes and only committing on success does not work for really large syncs (would cause really large in-memory batches)
Marking blocks as "we really have everything below" does not work very well (pretty much every block would need to be marked becase branches can join the DAG at any point). And we'd need to walk-back the branches and do the marking once we're finished.
We can have a regular "check" thread that re-walks the DAG, potentially trying to complete any pending branches, but it should probably not interfere with other syncing.
- Once successfully finished, it can mark current heads as green. Any DAG built on top will not have to sync to the bottom. It can also only run when syncing errors have happened.

Per issue #23, any error while processing a branch of the dag results in that branch never being processed again, as we have the blocks in the blockstore and we assume that every block we have has been processed. This PR adds a repair goroutine that runs whenever the crdt-store is marked as dirty. The crdt-store gets marked as dirty when errors have occurred when "processing nodes" or "sending new jobs" (fetching nodes from the network usually). This task, which runs every RepairInterval, starts walking down the DAG from the heads and retriggers re-processing for every node that was not processed. It does not stop until the bottom has been reached. In order to know which DAG nodes have been processed and which ones haven't we need to explicitally store the list of processed nodes (which is different from the list of nodes that we have). A node becomes processed as soon as its delta has been merged. Repair can be an expensive operation thus we should avoid marking the datastore as dirty easily, which perhaps is the case when we shutdown while processing things. Once repair reaches the bottom, the dirty bit is cleared. Datastores in previous versions will reprocess the store on first run, as they need to mark all the nodes in the DAG as processed.

* Allow self-repair of the store Per issue #23, any error while processing a branch of the dag results in that branch never being processed again, as we have the blocks in the blockstore and we assume that every block we have has been processed. This PR adds a repair goroutine that runs whenever the crdt-store is marked as dirty. The crdt-store gets marked as dirty when errors have occurred when "processing nodes" or "sending new jobs" (fetching nodes from the network usually). This task, which runs every RepairInterval, starts walking down the DAG from the heads and retriggers re-processing for every node that was not processed. It does not stop until the bottom has been reached. In order to know which DAG nodes have been processed and which ones haven't we need to explicitally store the list of processed nodes (which is different from the list of nodes that we have). A node becomes processed as soon as its delta has been merged. Repair can be an expensive operation thus we should avoid marking the datastore as dirty easily, which perhaps is the case when we shutdown while processing things. Once repair reaches the bottom, the dirty bit is cleared. Datastores in previous versions will reprocess the store on first run, as they need to mark all the nodes in the DAG as processed. * DAGSyncer can become a DAGService It is no longer necessary to check if a block is in the blockstore, as we keep track separately. * Allow disabling auto-repair when interval set to 0 * Add public Repair method and improve repair logic and logging * Improve node processing logic and queue caching for deduplication I believe nodes could be removed from the queue by sendJobs and then re-added to the queue before they were processed, resulting in several workers reprocessing the same nodes potentially. * Update globaldb example * Enable processing of multiple heads in parallel Now that we have improved processing logic to make branch-processing not step constantly on each other's feet, we can re-enable processing multiple heads in parallel and take advantage of the multiple workers which are normally idling as only one head is processed at a time. * Keep improving the repair process and the syncing Ensure that re-processed branches are setting the right heads. Tighten up logic to avoid duplication of processed nodes. * Do not rebroadcast anything if there are no heads to rebroadcast * Fix repair notifications * Add helper functions to the processedBlock namespace * Repair: Just keep a full set of visited nodes while walking In memory, unfortunately.

hsanjuan · 2022-11-24T16:31:21Z

This was fixed by #144

hsanjuan changed the title ~~Interrupted branch exploration prevents any future syncing~~ Interrupted branch exploration prevents any future syncing of that branch Feb 18, 2020

hsanjuan added the kind/bug A bug in existing code (including security flaws) label Feb 18, 2020

hsanjuan mentioned this issue Feb 20, 2020

Interrupted CRDT syncing is not resumable ipfs-cluster/ipfs-cluster#1007

Closed

jsimnz mentioned this issue Apr 29, 2020

Set "Remove" operation removes all tags repeatedly #38

Closed

hsanjuan mentioned this issue Feb 1, 2022

Add auto-repair to fully sync the store #144

Merged

jsimnz mentioned this issue May 24, 2022

P2P data sync may deadlock on error sourcenetwork/defradb#470

Closed

hsanjuan closed this as completed Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interrupted branch exploration prevents any future syncing of that branch #23

Interrupted branch exploration prevents any future syncing of that branch #23

hsanjuan commented Feb 18, 2020

hsanjuan commented Nov 24, 2022

Interrupted branch exploration prevents any future syncing of that branch #23

Interrupted branch exploration prevents any future syncing of that branch #23

Comments

hsanjuan commented Feb 18, 2020

hsanjuan commented Nov 24, 2022