Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: implement raftMu-based raft log snapshot #131393

Merged
merged 4 commits into from
Sep 30, 2024

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Sep 25, 2024

This PR implements the raft.LogStorageSnapshot interface in kvserver. The implementation behaves like a consistent log storage snapshot while Replica.raftMu is held, relying on the fact that all raft log storage writes are synchronized via this mutex.

Part of #130955

Copy link

blathers-crl bot commented Sep 25, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 25, 2024

@tbg @sumeerbhola This is a draft, but should be working. Only the last commit is interesting, others are mechanical / clean-ups. Could you please do an initial pass to make sure this makes sense?

The usage will be:

// --- in handleRaftReadyRaftMuLocked
r.mu.Lock()
// getting stuff from RawNode
r.withRaftGroupLocked(... {
	snap = raftGroup.LogSnapshot()
})
r.mu.Unlock()
// snap is still valid
// pass it to RACv2 via RaftEvent
// RACv2 calls snap.LogSlice() to construct appends
// pass them back to raft via "send MsgApp" method (next PR)

// --- in "I want to send appends" RACv2 handler (raftMu held) on raft scheduler
r.mu.Lock()
snap := LogSnapshot()
// read other relevant stuff from RawNode
r.mu.Unlock()
snap.LogSlice()
// pass the slices to raft via "send MsgApp" method (next PR)

NB: the replicaLogSnap type is an alias to Replica so that converting it to raft.LogStorageSnapshot interface does not incur an allocation.

@pav-kv pav-kv requested review from tbg and sumeerbhola September 25, 2024 23:34
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "intended usage" makes sense to me. As you can tell from my comments, they're mostly concerned with making the locking (both the one we already had in replicaRaftStorage and now the log storage snapshot) "obviously correct". I have hunch that removing the storage methods from raft and passing them in where needed will get us most of the way there, and that this is not a prohibitive amount of work. I'd like us to at least discuss whether this is actually true, and if so, find a way to get this done in a way that doesn't slow RACv2 momentum (perhaps by merging a slight variation of this PR first and then doing the simplification, which should not change the integration points much). Happy to chat today.

// replicaLogSnap implements the raft.LogStorageSnapshot interface.
//
// The type implements a limited version of a raft log storage snapshot, without
// needing a storage engine snapshot. It relies on the fact that all raft writes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment below - since raftMu being held throughout is what makes this a snapshot, should we rename this to replicaRaftMuLogSnap? I think that drives home nicely the crucial bit of information which I worry will be lost on readers or drive confusion, which is that this kind of snapshot is essentially an "optimization" over the "obvious" implementation which would have us grab a consistent iterator upon construction: instead we make use of the fact that raftMu prevents mutation and so any "actual" snapshotting can be elided. (We should talk about it like that too in the type comments etc, btw, unless I'm wrong).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that I still find tricky is that we're then "hiding" this raftMu-edness by going through these interfaces. We have to somehow make it obvious that raft has this interface method to get a snapshot, but that we know so much about the circumstances in which this will be called (raftMu is held throughout) that we can plop our "shim" implementation in that doesn't actually need real snapshotting.

I think there is something worth cleaning up here. The raft interface currently is obfuscating what is going on, rather than help explain it and I would prefer to address this or at least understand how we could. I think this problem is similar to the one we have with all of the existing methods on raft.Storage so there is a good payoff beyond just raft storage snapshots.

The core problem seems to be that currently, RawNode obtains the snapshot and we rely on our knowledge that it only calls this particular method in the LogStorage interface in methods that we call with certain mutexes held. If we always passed in the storage with each call that needs it, this problem would go away or at least become significantly more transparent.

For example, instead of rawNode.Ready() calling rawNode.storage.Entries() which in turn is our *Replica-level implementation that assumes certain locks are held by the caller of Ready, we would have something like

r.raftMu.Lock()
r.mu.Lock()

rd := rawNode.Ready(raftMuAndMuHeldRaftStorage(*Replica))
// ...

r.mu.Unlock()
r.raftMu.Unlock()

It also seems like a somewhat moderate clean-up that is mostly mechanical, but still has the potential to make this whole locking business vastly more understandable, perhaps to the point of it being "obviously correct".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that making things more explicit / less callback-ey would be great, I'll explore in follow-up after getting something working for RACv2.

There is a slight benefit in the current approach, in that there is one LogStorage object inside raft, so we have a guarantee that all snapshots are for the same object. By passing the snapshot into Ready, we would "denormalize" this (raft has LogStorage internally, but also receives snapshots from the caller - and they must match what it has internally). So if we flip the API to accept storage from outside, we should do it all the way: raft would have no storage to read from except when we tell it. Or bring it to next level, and always call log storage methods on our end explicitly, and raft would only instruct us which log slices it's interested in.

The heavy users of the log storage are MsgApp, local log appends, local state machine application. Eventually we want to flip control and pace all of them (via the enhanced Ready API or 3 parallel smaller ones). So flipping the storage interface ownership/passing fits.

After this, LogStorage can be removed from RawNode because it won't have readers from it. There are a few caveats like this uncontrolled scan happening in campaigning steps. But it can be optimized out (will file an issue).

@sumeerbhola sumeerbhola requested a review from tbg September 26, 2024 17:43
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please do an initial pass to make sure this makes sense?

Looks fine to me from a usage perspective.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @tbg)

This commit mechanically moves the implementation of the raft LogStorage
interface into a new file.

Epic: none
Release note: none
Epic: none
Release note: none
@pav-kv pav-kv force-pushed the raft-log-storage-snapshot branch from c41d4a7 to 3202012 Compare September 28, 2024 19:12
This commit implements the raft.LogStorageSnapshot interface in
kvserver. The implementation behaves like a consistent log storage
snapshot while Replica.raftMu is held, relying on the fact that all raft
log storage writes are synchronized via this mutex.

Epic: none
Release note: none
@pav-kv pav-kv force-pushed the raft-log-storage-snapshot branch from 3202012 to 61f7b3c Compare September 28, 2024 19:15
@pav-kv pav-kv marked this pull request as ready for review September 28, 2024 19:15
@pav-kv pav-kv requested a review from a team as a code owner September 28, 2024 19:15
@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 30, 2024

bors r=tbg,sumeerbhola

@craig craig bot merged commit 63bfd0e into cockroachdb:master Sep 30, 2024
23 checks passed
@pav-kv pav-kv deleted the raft-log-storage-snapshot branch September 30, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants