Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupted branch exploration prevents any future syncing of that branch #23

Closed
hsanjuan opened this issue Feb 18, 2020 · 1 comment
Closed
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@hsanjuan
Copy link
Collaborator

When a new head is received, syncing for that branch starts until the DAGSyncer returns HasBlock() = true for the current block.

This means that the block we want to process is already local, so we can assume that it was already processed in the past (along with everything below).

However, it may be that we did not finish fully processing a branch, and while we have some of the newer blocks, we never fetched the old ones.

Raw thinking: Currently:

  • There is no way to force a re-walk of the full DAG.
  • There is no way to know that we really have the full DAG under a given block (and therefore we've no need to walk it).
  • Batching the walked nodes and only committing on success does not work for really large syncs (would cause really large in-memory batches)
  • Marking blocks as "we really have everything below" does not work very well (pretty much every block would need to be marked becase branches can join the DAG at any point). And we'd need to walk-back the branches and do the marking once we're finished.
  • We can have a regular "check" thread that re-walks the DAG, potentially trying to complete any pending branches, but it should probably not interfere with other syncing.
    • Once successfully finished, it can mark current heads as green. Any DAG built on top will not have to sync to the bottom. It can also only run when syncing errors have happened.
@hsanjuan hsanjuan changed the title Interrupted branch exploration prevents any future syncing Interrupted branch exploration prevents any future syncing of that branch Feb 18, 2020
@hsanjuan hsanjuan added the kind/bug A bug in existing code (including security flaws) label Feb 18, 2020
hsanjuan added a commit that referenced this issue Feb 1, 2022
Per issue #23, any error while processing a branch of the dag results in that
branch never being processed again, as we have the blocks in the blockstore and we
assume that every block we have has been processed.

This PR adds a repair goroutine that runs whenever the crdt-store is marked as dirty.

The crdt-store gets marked as dirty when errors have occurred when "processing
nodes" or "sending new jobs" (fetching nodes from the network usually).

This task, which runs every RepairInterval, starts walking down the DAG from
the heads and retriggers re-processing for every node that was not processed.

It does not stop until the bottom has been reached.

In order to know which DAG nodes have been processed and which ones haven't we
need to explicitally store the list of processed nodes (which is different
from the list of nodes that we have). A node becomes processed as soon as its
delta has been merged.

Repair can be an expensive operation thus we should avoid marking the
datastore as dirty easily, which perhaps is the case when we shutdown while
processing things.

Once repair reaches the bottom, the dirty bit is cleared.

Datastores in previous versions will reprocess the store on first run, as they
need to mark all the nodes in the DAG as processed.
hsanjuan added a commit that referenced this issue Feb 1, 2022
* Allow self-repair of the store

Per issue #23, any error while processing a branch of the dag results in that
branch never being processed again, as we have the blocks in the blockstore and we
assume that every block we have has been processed.

This PR adds a repair goroutine that runs whenever the crdt-store is marked as dirty.

The crdt-store gets marked as dirty when errors have occurred when "processing
nodes" or "sending new jobs" (fetching nodes from the network usually).

This task, which runs every RepairInterval, starts walking down the DAG from
the heads and retriggers re-processing for every node that was not processed.

It does not stop until the bottom has been reached.

In order to know which DAG nodes have been processed and which ones haven't we
need to explicitally store the list of processed nodes (which is different
from the list of nodes that we have). A node becomes processed as soon as its
delta has been merged.

Repair can be an expensive operation thus we should avoid marking the
datastore as dirty easily, which perhaps is the case when we shutdown while
processing things.

Once repair reaches the bottom, the dirty bit is cleared.

Datastores in previous versions will reprocess the store on first run, as they
need to mark all the nodes in the DAG as processed.

* DAGSyncer can become a DAGService

It is no longer necessary to check if a block is in the blockstore, as we keep
track separately.

* Allow disabling auto-repair when interval set to 0

* Add public Repair method and improve repair logic and logging

* Improve node processing logic and queue caching for deduplication

I believe nodes could be removed from the queue by sendJobs
and then re-added to the queue before they were processed, resulting
in several workers reprocessing the same nodes potentially.

* Update globaldb example

* Enable processing of multiple heads in parallel

Now that we have improved processing logic to make branch-processing not step
constantly on each other's feet, we can re-enable processing multiple heads in
parallel and take advantage of the multiple workers which are normally idling
as only one head is processed at a time.

* Keep improving the repair process and the syncing

Ensure that re-processed branches are setting the right heads.

Tighten up logic to avoid duplication of processed nodes.

* Do not rebroadcast anything if there are no heads to rebroadcast

* Fix repair notifications

* Add helper functions to the processedBlock namespace

* Repair: Just keep a full set of visited nodes while walking

In memory, unfortunately.
@hsanjuan
Copy link
Collaborator Author

This was fixed by #144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

1 participant