-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: implement raftMu-based raft log snapshot #131393
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@tbg @sumeerbhola This is a draft, but should be working. Only the last commit is interesting, others are mechanical / clean-ups. Could you please do an initial pass to make sure this makes sense? The usage will be: // --- in handleRaftReadyRaftMuLocked
r.mu.Lock()
// getting stuff from RawNode
r.withRaftGroupLocked(... {
snap = raftGroup.LogSnapshot()
})
r.mu.Unlock()
// snap is still valid
// pass it to RACv2 via RaftEvent
// RACv2 calls snap.LogSlice() to construct appends
// pass them back to raft via "send MsgApp" method (next PR)
// --- in "I want to send appends" RACv2 handler (raftMu held) on raft scheduler
r.mu.Lock()
snap := LogSnapshot()
// read other relevant stuff from RawNode
r.mu.Unlock()
snap.LogSlice()
// pass the slices to raft via "send MsgApp" method (next PR) NB: the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "intended usage" makes sense to me. As you can tell from my comments, they're mostly concerned with making the locking (both the one we already had in replicaRaftStorage
and now the log storage snapshot) "obviously correct". I have hunch that removing the storage methods from raft and passing them in where needed will get us most of the way there, and that this is not a prohibitive amount of work. I'd like us to at least discuss whether this is actually true, and if so, find a way to get this done in a way that doesn't slow RACv2 momentum (perhaps by merging a slight variation of this PR first and then doing the simplification, which should not change the integration points much). Happy to chat today.
pkg/kv/kvserver/replica_raftlog.go
Outdated
// replicaLogSnap implements the raft.LogStorageSnapshot interface. | ||
// | ||
// The type implements a limited version of a raft log storage snapshot, without | ||
// needing a storage engine snapshot. It relies on the fact that all raft writes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my comment below - since raftMu
being held throughout is what makes this a snapshot, should we rename this to replicaRaftMuLogSnap
? I think that drives home nicely the crucial bit of information which I worry will be lost on readers or drive confusion, which is that this kind of snapshot is essentially an "optimization" over the "obvious" implementation which would have us grab a consistent iterator upon construction: instead we make use of the fact that raftMu
prevents mutation and so any "actual" snapshotting can be elided. (We should talk about it like that too in the type comments etc, btw, unless I'm wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part that I still find tricky is that we're then "hiding" this raftMu-edness by going through these interfaces. We have to somehow make it obvious that raft has this interface method to get a snapshot, but that we know so much about the circumstances in which this will be called (raftMu is held throughout) that we can plop our "shim" implementation in that doesn't actually need real snapshotting.
I think there is something worth cleaning up here. The raft interface currently is obfuscating what is going on, rather than help explain it and I would prefer to address this or at least understand how we could. I think this problem is similar to the one we have with all of the existing methods on raft.Storage
so there is a good payoff beyond just raft storage snapshots.
The core problem seems to be that currently, RawNode
obtains the snapshot and we rely on our knowledge that it only calls this particular method in the LogStorage
interface in methods that we call with certain mutexes held. If we always passed in the storage with each call that needs it, this problem would go away or at least become significantly more transparent.
For example, instead of rawNode.Ready()
calling rawNode.storage.Entries()
which in turn is our *Replica
-level implementation that assumes certain locks are held by the caller of Ready
, we would have something like
r.raftMu.Lock()
r.mu.Lock()
rd := rawNode.Ready(raftMuAndMuHeldRaftStorage(*Replica))
// ...
r.mu.Unlock()
r.raftMu.Unlock()
It also seems like a somewhat moderate clean-up that is mostly mechanical, but still has the potential to make this whole locking business vastly more understandable, perhaps to the point of it being "obviously correct".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that making things more explicit / less callback-ey would be great, I'll explore in follow-up after getting something working for RACv2.
There is a slight benefit in the current approach, in that there is one LogStorage
object inside raft, so we have a guarantee that all snapshots are for the same object. By passing the snapshot into Ready
, we would "denormalize" this (raft has LogStorage
internally, but also receives snapshots from the caller - and they must match what it has internally). So if we flip the API to accept storage from outside, we should do it all the way: raft would have no storage to read from except when we tell it. Or bring it to next level, and always call log storage methods on our end explicitly, and raft would only instruct us which log slices it's interested in.
The heavy users of the log storage are MsgApp
, local log appends, local state machine application. Eventually we want to flip control and pace all of them (via the enhanced Ready
API or 3 parallel smaller ones). So flipping the storage interface ownership/passing fits.
After this, LogStorage
can be removed from RawNode
because it won't have readers from it. There are a few caveats like this uncontrolled scan happening in campaigning steps. But it can be optimized out (will file an issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please do an initial pass to make sure this makes sense?
Looks fine to me from a usage perspective.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)
This commit mechanically moves the implementation of the raft LogStorage interface into a new file. Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
c41d4a7
to
3202012
Compare
This commit implements the raft.LogStorageSnapshot interface in kvserver. The implementation behaves like a consistent log storage snapshot while Replica.raftMu is held, relying on the fact that all raft log storage writes are synchronized via this mutex. Epic: none Release note: none
3202012
to
61f7b3c
Compare
bors r=tbg,sumeerbhola |
This PR implements the
raft.LogStorageSnapshot
interface inkvserver
. The implementation behaves like a consistent log storage snapshot whileReplica.raftMu
is held, relying on the fact that all raft log storage writes are synchronized via this mutex.Part of #130955