Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: fetch from log storage without holding Replica.mu #130955

Closed
pav-kv opened this issue Sep 18, 2024 · 1 comment
Closed

raft: fetch from log storage without holding Replica.mu #130955

pav-kv opened this issue Sep 18, 2024 · 1 comment
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 18, 2024

Raft fetches entries from log storage when:

  • Constructing Ready / MsgStorageApply containing the committed entries to be applied to the state machine.
  • Constructing MsgApp messages that leader wants to send to its peers. These mostly happen in Ready handler too, but can also happen when flushing the proposal buffer [kv: grow propBufArray on full, instead of flushing #128313].
  • There are corner cases when this needs to be done, e.g. the log scan when a node campaigns.

In all these cases, we hold Replica.mu. In most cases, we also hold Replica.raftMu. Replica.mu is considered narrow and should not be held while IO with the log/state-machine storage occurs (it has performance implications). This is possible to achieve.


To support Replica.mu-less log fetch, we should keep the following mental model in mind:

  • The raft.RawNode must always be accessed while holding Replica.mu (can be locked for reads if we access read-only methods).
  • All raft-related IO occurs under raftMu (except maybe asynchronous log fsyncs).
  • If we hold both mutexes, the RawNode is consistent with the storage.
  • If we release Replica.mu, but still hold raftMu, the storage is consistent with the RawNode state before the mu release.

This mental model enables us to build RawNode API for accessing the log without holding Replica.mu, while still holding raftMu.

The obstacle today is that the log fetch function (which loads a prefix from the storage, and stitches it together with the in-memory unstable suffix) is inside raft internals. It necessitates us to lock Replica.mu for calling it. To avoid the mu lock, we should externalize a log API so that this function be called while holding only raftMu, on a raft log snapshot obtained when we were previously holding both mu and raftMu.

Prototype: #130932
Related to #128779

Jira issue: CRDB-42305

@pav-kv pav-kv added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. A-kv Anything in KV that doesn't belong in a more specific category. labels Sep 18, 2024
@pav-kv pav-kv self-assigned this Sep 18, 2024
craig bot pushed a commit that referenced this issue Sep 18, 2024
130980: raft: clean up raftLog entry slice indexing r=tbg a=pav-kv

This PR cleans up various `raftLog/unstable` methods, and converts the sub-slicing methods to `(lo, hi]` semantics instead of `[lo, hi)`.

All raft log slices are constructed in context of being appended after a certain index, so `(lo, hi]` addressing makes more sense, and simplifies a bunch of +-1 arithmetics.

Related to #130967, #130955

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Sep 24, 2024
131041: kvserver: document raft Storage mental model r=nvanbenschoten a=pav-kv

This PR documents the use of `Replica.{raftMu,mu}` mutexes in `raft.{RawNode,Storage}`, and fixes a few minor things along the way.

Part of #130955

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Sep 25, 2024
130967: raft: introduce LogSnapshot API r=tbg a=pav-kv

The API allows reading from raft log while `RawNode` continues making progress. Practically, this means we can unlock `Replica.mu` after having obtained the log snapshot, and can safely read from it while only holding `Replica.raftMu`.

To be used for:
- Fetching committed entries in `Ready` handler while only holding `raftMu` (instead of both `mu` and `raftMu`).
- Building leader->follower `MsgApp` messages on demand, driven by RACv2. Today these messages can be constructed (and incur IO) eagerly on any `Step` while holding `mu`.

Part of #128779
Part of #130955

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Sep 30, 2024
131393: kvserver: implement raftMu-based raft log snapshot  r=tbg,sumeerbhola a=pav-kv

This PR implements the `raft.LogStorageSnapshot` interface in `kvserver`. The implementation behaves like a consistent log storage snapshot while `Replica.raftMu` is held, relying on the fact that all raft log storage writes are synchronized via this mutex.

Part of #130955

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@pav-kv pav-kv self-assigned this Oct 2, 2024
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 2, 2024

This is done for RACv2 purposes. The LogSnapshot + SengMsgApp API allows reading from the log under raftMu only. There are more cases when raft pulls from storage under Replica.mu unrelated to RACv2, we may tackle them later.

@pav-kv pav-kv closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

1 participant