-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: fetch from log storage without holding Replica.mu #130955
Labels
A-kv
Anything in KV that doesn't belong in a more specific category.
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
Comments
This was referenced Sep 18, 2024
craig bot
pushed a commit
that referenced
this issue
Sep 18, 2024
130980: raft: clean up raftLog entry slice indexing r=tbg a=pav-kv This PR cleans up various `raftLog/unstable` methods, and converts the sub-slicing methods to `(lo, hi]` semantics instead of `[lo, hi)`. All raft log slices are constructed in context of being appended after a certain index, so `(lo, hi]` addressing makes more sense, and simplifies a bunch of +-1 arithmetics. Related to #130967, #130955 Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
This was referenced Sep 20, 2024
craig bot
pushed a commit
that referenced
this issue
Sep 24, 2024
131041: kvserver: document raft Storage mental model r=nvanbenschoten a=pav-kv This PR documents the use of `Replica.{raftMu,mu}` mutexes in `raft.{RawNode,Storage}`, and fixes a few minor things along the way. Part of #130955 Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
craig bot
pushed a commit
that referenced
this issue
Sep 25, 2024
130967: raft: introduce LogSnapshot API r=tbg a=pav-kv The API allows reading from raft log while `RawNode` continues making progress. Practically, this means we can unlock `Replica.mu` after having obtained the log snapshot, and can safely read from it while only holding `Replica.raftMu`. To be used for: - Fetching committed entries in `Ready` handler while only holding `raftMu` (instead of both `mu` and `raftMu`). - Building leader->follower `MsgApp` messages on demand, driven by RACv2. Today these messages can be constructed (and incur IO) eagerly on any `Step` while holding `mu`. Part of #128779 Part of #130955 Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
craig bot
pushed a commit
that referenced
this issue
Sep 30, 2024
131393: kvserver: implement raftMu-based raft log snapshot r=tbg,sumeerbhola a=pav-kv This PR implements the `raft.LogStorageSnapshot` interface in `kvserver`. The implementation behaves like a consistent log storage snapshot while `Replica.raftMu` is held, relying on the fact that all raft log storage writes are synchronized via this mutex. Part of #130955 Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
This is done for RACv2 purposes. The |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv
Anything in KV that doesn't belong in a more specific category.
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
Raft fetches entries from log storage when:
Ready
/MsgStorageApply
containing the committed entries to be applied to the state machine.MsgApp
messages that leader wants to send to its peers. These mostly happen in Ready handler too, but can also happen when flushing the proposal buffer [kv: grow propBufArray on full, instead of flushing #128313].In all these cases, we hold
Replica.mu
. In most cases, we also holdReplica.raftMu
.Replica.mu
is considered narrow and should not be held while IO with the log/state-machine storage occurs (it has performance implications). This is possible to achieve.To support
Replica.mu
-less log fetch, we should keep the following mental model in mind:raft.RawNode
must always be accessed while holdingReplica.mu
(can be locked for reads if we access read-only methods).raftMu
(except maybe asynchronous log fsyncs).RawNode
is consistent with the storage.Replica.mu
, but still holdraftMu
, the storage is consistent with theRawNode
state before themu
release.This mental model enables us to build
RawNode
API for accessing the log without holdingReplica.mu
, while still holdingraftMu
.The obstacle today is that the log fetch function (which loads a prefix from the storage, and stitches it together with the in-memory
unstable
suffix) is insideraft
internals. It necessitates us to lockReplica.mu
for calling it. To avoid themu
lock, we should externalize a log API so that this function be called while holding onlyraftMu
, on a raft log snapshot obtained when we were previously holding bothmu
andraftMu
.Prototype: #130932
Related to #128779
Jira issue: CRDB-42305
The text was updated successfully, but these errors were encountered: