Skip to content

Conversation

lejeunerenard
Copy link
Contributor

This PR currently is only the first step of moving the bitfield into the replicator. Since the bitfield is only required by the replicator for checking local vs remote blocks.

This refactor will require updating the following known locations:

lib/core.js Outdated

this.replicator.onupgrade()
this.replicator.onhave(start, length, drop)
this.replicator.onupgrade()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be before onhave so it doesn’t signal out of bounds blocks

mafintosh and others added 20 commits October 7, 2025 13:32
Everywhere that updates or reads the bitfield needs to be async since
it's going to storage. So many internal functions inherited the
bitfield's async.
Awaiting to reduce potential timing issues for reading from the bitfield.
`_updateNonPrimary()` reads from the bitfield for clamping the range
request via `clampRange()`. This clamping can hit a race condition where
it is updated via `onhave`'s `_setBitfieldRanges()` mid reading. This
causes the clamp to revolve the requests before the "primary" can
respond etc.
With the change to the bitfield being asynchronous, the previous
synchronous's implicit batch reading & writing ensure no conflicts
from writing while reading. To enforce that operations are sequential an
internal lock was added so if operations are called without awaiting or
are called between event loops because of an external message etc, they
will still not execute simultaneously.

To protect against read & write operations interleaving when they are
intended to be sequential, an external lock was added to claim the
bitfield roughly per protocol message. Because this theoretically will
cover the internal lock scenario by keeping access to a single chain of
async calls, it might be possible to remove the internal lock in the
future.

Ideally these wont be necessary but currently they solve the above
issues.
Enabling locks here fixes the test `bigger download range` in
`test/replicate.js`. This test would flake when the download event would
trigger after the request was resolved. This happened from
`_updateNonPrimary()` triggering the range request resolved before the
primary processing could emit due to race condition where the bitfield
was updated while reading giving a false clamped range.

The bitfield was primarily updated in `core.verify()` and setting a lock
around the entire `ondata` call chain enabled other `data` messages to
queue up instead of verifying blocks and updating the bitfield before
previous requests could respond.
Used for bitfield locks.
The `_request*` methods are assumed to be synchronous in the replication
state machine. To avoid converting the entire state machine to async,
the pages are loaded preemptively and checked synchronously.

This currently happens in `_requestSeek()` & `_requestRangeBlock()`.
Ensures the bitfield remains unchanged while iterating through want
call.
Part of the previous commit to await all `replicator.onupgrade()` calls.
Without this lock, the `_update()` call in the `onsync` event doesn't
know about the remote bitfield update. So while `onsync` doesn't access
the `localBitfield` it does rely on updates to the `remoteBitfield`
which happen along with the `localBitfield` updates.

Fixes the 'restore after cancelled block request' test which could fail
because the sync after the new append wouldn't cause a `block` request
since the `b` peer assumed the `a` peer doesn't have the block. Since
the test waits for an append event (which doesn't mean the `upload`
event on `a` will be triggered) and the connection is destroyed
afterwards.
Because `.broadcastRange()` called during `onopen` is now async, the
peer can be added to the replicator after a sync close is called on the
protomux channel. Since `onclose` assumed synchronous calls, it assumes
the peer is already added before it's closed. With it not added yet, the
peer isn't removed from the replicator and `replicator.destroy()` will
loop forever.

Added a destroy method to the bitfield for destroying the locks. Not
required for the fix, but should be reasonable to destroy
regardless.
Now that bitfield operations go to disk and not just to memory, they are
slower and need more time for larger number of blocks.
Caused timing errors when it attempts to read from storage but it has
been closed.
Now that checking the bitfield is async which makes iterating through
ranges async, the `_ranges` array can be mutated elsewhere while
awaiting. This means the current index of the range request can be
inaccurate when resolving the request.

To prevent this, the request index is looked up synchronously when the
request is resolved. This way the index is accurate and another request
isn't potentially clobbered by the popped head.

Since this logic already exists to unref the request (for gc'ing &
cancelling), the `_unref()` is reused. A success boolean is needed to
update the index in the `_updateNonPrimary()` ranges loop. So all
request `_unref()`s are updated to return a success bool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants