kvserver: implement raftMu-based raft log snapshot #131393

pav-kv · 2024-09-25T22:50:44Z

This PR implements the raft.LogStorageSnapshot interface in kvserver. The implementation behaves like a consistent log storage snapshot while Replica.raftMu is held, relying on the fact that all raft log storage writes are synchronized via this mutex.

Part of #130955

blathers-crl · 2024-09-25T22:50:48Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-09-25T22:50:57Z

This change is

pav-kv · 2024-09-25T23:33:54Z

@tbg @sumeerbhola This is a draft, but should be working. Only the last commit is interesting, others are mechanical / clean-ups. Could you please do an initial pass to make sure this makes sense?

The usage will be:

// --- in handleRaftReadyRaftMuLocked
r.mu.Lock()
// getting stuff from RawNode
r.withRaftGroupLocked(... {
	snap = raftGroup.LogSnapshot()
})
r.mu.Unlock()
// snap is still valid
// pass it to RACv2 via RaftEvent
// RACv2 calls snap.LogSlice() to construct appends
// pass them back to raft via "send MsgApp" method (next PR)

// --- in "I want to send appends" RACv2 handler (raftMu held) on raft scheduler
r.mu.Lock()
snap := LogSnapshot()
// read other relevant stuff from RawNode
r.mu.Unlock()
snap.LogSlice()
// pass the slices to raft via "send MsgApp" method (next PR)

NB: the replicaLogSnap type is an alias to Replica so that converting it to raft.LogStorageSnapshot interface does not incur an allocation.

tbg

The "intended usage" makes sense to me. As you can tell from my comments, they're mostly concerned with making the locking (both the one we already had in replicaRaftStorage and now the log storage snapshot) "obviously correct". I have hunch that removing the storage methods from raft and passing them in where needed will get us most of the way there, and that this is not a prohibitive amount of work. I'd like us to at least discuss whether this is actually true, and if so, find a way to get this done in a way that doesn't slow RACv2 momentum (perhaps by merging a slight variation of this PR first and then doing the simplification, which should not change the integration points much). Happy to chat today.

pkg/kv/kvserver/replica_raftlog.go

tbg · 2024-09-26T08:06:12Z

pkg/kv/kvserver/replica_raftlog.go

+// replicaLogSnap implements the raft.LogStorageSnapshot interface.
+//
+// The type implements a limited version of a raft log storage snapshot, without
+// needing a storage engine snapshot. It relies on the fact that all raft writes


Related to my comment below - since raftMu being held throughout is what makes this a snapshot, should we rename this to replicaRaftMuLogSnap? I think that drives home nicely the crucial bit of information which I worry will be lost on readers or drive confusion, which is that this kind of snapshot is essentially an "optimization" over the "obvious" implementation which would have us grab a consistent iterator upon construction: instead we make use of the fact that raftMu prevents mutation and so any "actual" snapshotting can be elided. (We should talk about it like that too in the type comments etc, btw, unless I'm wrong).

The part that I still find tricky is that we're then "hiding" this raftMu-edness by going through these interfaces. We have to somehow make it obvious that raft has this interface method to get a snapshot, but that we know so much about the circumstances in which this will be called (raftMu is held throughout) that we can plop our "shim" implementation in that doesn't actually need real snapshotting.

I think there is something worth cleaning up here. The raft interface currently is obfuscating what is going on, rather than help explain it and I would prefer to address this or at least understand how we could. I think this problem is similar to the one we have with all of the existing methods on raft.Storage so there is a good payoff beyond just raft storage snapshots.

The core problem seems to be that currently, RawNode obtains the snapshot and we rely on our knowledge that it only calls this particular method in the LogStorage interface in methods that we call with certain mutexes held. If we always passed in the storage with each call that needs it, this problem would go away or at least become significantly more transparent.

For example, instead of rawNode.Ready() calling rawNode.storage.Entries() which in turn is our *Replica-level implementation that assumes certain locks are held by the caller of Ready, we would have something like

r.raftMu.Lock() r.mu.Lock() rd := rawNode.Ready(raftMuAndMuHeldRaftStorage(*Replica)) // ... r.mu.Unlock() r.raftMu.Unlock()

It also seems like a somewhat moderate clean-up that is mostly mechanical, but still has the potential to make this whole locking business vastly more understandable, perhaps to the point of it being "obviously correct".

I agree that making things more explicit / less callback-ey would be great, I'll explore in follow-up after getting something working for RACv2.

There is a slight benefit in the current approach, in that there is one LogStorage object inside raft, so we have a guarantee that all snapshots are for the same object. By passing the snapshot into Ready, we would "denormalize" this (raft has LogStorage internally, but also receives snapshots from the caller - and they must match what it has internally). So if we flip the API to accept storage from outside, we should do it all the way: raft would have no storage to read from except when we tell it. Or bring it to next level, and always call log storage methods on our end explicitly, and raft would only instruct us which log slices it's interested in.

The heavy users of the log storage are MsgApp, local log appends, local state machine application. Eventually we want to flip control and pace all of them (via the enhanced Ready API or 3 parallel smaller ones). So flipping the storage interface ownership/passing fits.

After this, LogStorage can be removed from RawNode because it won't have readers from it. There are a few caveats like this uncontrolled scan happening in campaigning steps. But it can be optimized out (will file an issue).

sumeerbhola

Could you please do an initial pass to make sure this makes sense?

Looks fine to me from a usage perspective.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)

This commit mechanically moves the implementation of the raft LogStorage interface into a new file. Epic: none Release note: none

Epic: none Release note: none

This commit implements the raft.LogStorageSnapshot interface in kvserver. The implementation behaves like a consistent log storage snapshot while Replica.raftMu is held, relying on the fact that all raft log storage writes are synchronized via this mutex. Epic: none Release note: none

pav-kv · 2024-09-30T13:03:27Z

bors r=tbg,sumeerbhola

craig · 2024-09-30T13:31:51Z

Build succeeded:

pav-kv requested review from tbg and sumeerbhola September 25, 2024 23:34

tbg reviewed Sep 26, 2024

View reviewed changes

sumeerbhola requested a review from tbg September 26, 2024 17:43

sumeerbhola reviewed Sep 26, 2024

View reviewed changes

pav-kv added 3 commits September 28, 2024 18:57

kvserver: move raft log storage to a file

d41ac44

This commit mechanically moves the implementation of the raft LogStorage interface into a new file. Epic: none Release note: none

kvserver: introduce replicaLogStorage type alias

42a23b2

Epic: none Release note: none

kvserver: convention renames and commenting

4994648

Epic: none Release note: none

pav-kv force-pushed the raft-log-storage-snapshot branch from c41d4a7 to 3202012 Compare September 28, 2024 19:12

pav-kv force-pushed the raft-log-storage-snapshot branch from 3202012 to 61f7b3c Compare September 28, 2024 19:15

pav-kv marked this pull request as ready for review September 28, 2024 19:15

pav-kv requested a review from a team as a code owner September 28, 2024 19:15

tbg approved these changes Sep 30, 2024

View reviewed changes

craig bot merged commit 63bfd0e into cockroachdb:master Sep 30, 2024
23 checks passed

pav-kv deleted the raft-log-storage-snapshot branch September 30, 2024 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: implement raftMu-based raft log snapshot #131393

kvserver: implement raftMu-based raft log snapshot #131393

pav-kv commented Sep 25, 2024 •

edited

Loading

blathers-crl bot commented Sep 25, 2024

cockroach-teamcity commented Sep 25, 2024

pav-kv commented Sep 25, 2024 •

edited

Loading

tbg left a comment

tbg Sep 26, 2024

tbg Sep 26, 2024

pav-kv Sep 26, 2024

sumeerbhola left a comment

pav-kv commented Sep 30, 2024

craig bot commented Sep 30, 2024

kvserver: implement raftMu-based raft log snapshot #131393

kvserver: implement raftMu-based raft log snapshot #131393

Conversation

pav-kv commented Sep 25, 2024 • edited Loading

blathers-crl bot commented Sep 25, 2024

cockroach-teamcity commented Sep 25, 2024

pav-kv commented Sep 25, 2024 • edited Loading

tbg left a comment

Choose a reason for hiding this comment

tbg Sep 26, 2024

Choose a reason for hiding this comment

tbg Sep 26, 2024

Choose a reason for hiding this comment

pav-kv Sep 26, 2024

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv commented Sep 30, 2024

craig bot commented Sep 30, 2024

pav-kv commented Sep 25, 2024 •

edited

Loading

pav-kv commented Sep 25, 2024 •

edited

Loading