raft: support sending MsgApp from a log snapshot #130932

pav-kv · 2024-09-18T12:45:09Z

Epic: none
Release note: none

cockroach-teamcity · 2024-09-18T12:45:25Z

This change is

sumeerbhola

Reviewed 1 of 3 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)

pkg/kv/kvserver/replica_raft.go line 1471 at r4 (raw file):

		// FIXME: should know the parameters, as instructed by the send streams.
		var after, last, maxSize uint64
		slices[peer] = logSnap.LogSlice(after, last, maxSize)

This will happen inside the replicaSendStream and not centralized here. But the LogSlice API looks good (reminder: we want at least one entry in LogSlice even if it exceeds maxSize).

pkg/raft/log.go line 533 at r3 (raw file):

	return LogSnapshot{
		first:    l.firstIndex(),
		unstable: l.unstable.logSlice,

is a shallow copy of logSlice.entries ok because no one modifies the indices after appending?

pkg/raft/raft.go line 673 at r4 (raw file):

	commit := r.raftLog.committed
	// Send the MsgApp, and update the progress accordingly.
	r.send(pb.Message{

This will delay sending until the next Ready, which is not desirable. We have already deducted tokens for it and we don't want to have to schedule on the raft scheduler again. I would like it to behave as makeMsgAppAndAssumeSent(...) raftpb.Message so that caller (replicaSendStream) can immediately send. The replicaSendStream will hold Replica.mu when calling this, since there is no IO happening.

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/kv/kvserver/replica_raft.go line 1471 at r4 (raw file):

Previously, sumeerbhola wrote…

This will happen inside the replicaSendStream and not centralized here. But the LogSlice API looks good (reminder: we want at least one entry in LogSlice even if it exceeds maxSize).

Ack. Yeah, at least one entry semantics works, I just didn't elaborate it in the comment.

pkg/raft/log.go line 533 at r3 (raw file):

Previously, sumeerbhola wrote…

is a shallow copy of logSlice.entries ok because no one modifies the indices after appending?

Yes. Internally, raft only appends to the unstable slice. If it ever needs to regress / truncate the log and overwrite an index, it reallocates.

cockroach/pkg/raft/log_unstable.go

Lines 321 to 325 in 1e27189

    
           // Use the full slice expression to cause copy-on-write on this or a 
        
           // subsequent (if a.entries is empty) append to u.entries. The truncated 
        
           // part of the old slice can still be referenced elsewhere. 
        
           keep := u.entries[:a.prev.index-u.prev.index] 
        
           u.entries = append(keep[:len(keep):len(keep)], a.entries...)

pkg/raft/raft.go line 673 at r4 (raw file):

Previously, sumeerbhola wrote…

This will delay sending until the next Ready, which is not desirable. We have already deducted tokens for it and we don't want to have to schedule on the raft scheduler again. I would like it to behave as makeMsgAppAndAssumeSent(...) raftpb.Message so that caller (replicaSendStream) can immediately send. The replicaSendStream will hold Replica.mu when calling this, since there is no IO happening.

This is doable. Though who runs this? Is this run from within the raft scheduler thread? If so, this can be done right before Ready or inlined in Ready handler, and we wouldn't need a second scheduling event.

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/raft/log.go line 533 at r3 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Yes. Internally, raft only appends to the unstable slice. If it ever needs to regress / truncate the log and overwrite an index, it reallocates.

cockroach/pkg/raft/log_unstable.go

Lines 321 to 325 in 1e27189

// Use the full slice expression to cause copy-on-write on this or a

// subsequent (if a.entries is empty) append to u.entries. The truncated

// part of the old slice can still be referenced elsewhere.

keep := u.entries[:a.prev.index-u.prev.index]

u.entries = append(keep[:len(keep):len(keep)], a.entries...)

This safety is pre-existing. We already return unstable sub-slices via Ready(), which then get sent to async log storage. So unstable protects itself and the caller from mutating the slice.

tbg

The broad strokes of this look good!

tbg · 2024-09-18T19:17:27Z

pkg/raft/log.go

+type LogSnapshot struct {
+	first    uint64
+	unstable logSlice
+	storage  Storage


It's unexpected to me that the Storage is part of the snapshot. The snapshot provides a "view" into the Storage, assuming the Storage isn't mutated. But the snapshot doesn't "snapshot" the Storage. So it's worth considering whether all the methods on a LogSnapshot that actually need to do IO should take the Storage as a parameter.

(That said, if this is all complicated to disentagle, what we have here is fine and since we need to pass this LogSnapshot around, it could be difficult to do the plumbing for Storage).

I wonder if we could make a Storage wrapper which asserts that is wasn't mutated. E.g. I could imagine incrementing some (in memory only) "epoch" int in it whenever we mutate the log storage. Then make a lean wrapper around our Storage impl which, on every call, asserts that the epoch hasn't changed.

type StorageRO interface { // everything we have today in Storage } type Storage interface { StorageRO Snapshot() StorageRO } // somewhere in replica_raftstorage.go func (r *replicaRaftStorage) Snapshot() raft.StorageRO { return wrapper{r: r, epoch: r.epoch} } func (w wrapper) Entries(...) { if w.r.epoch != w.epoch { log.Fatalf() } return w.r.Entries(...) }

We are faking storage snapshots by just requiring that storage is not mutated for a while (abusing the fact that raftMu is held). But raft doesn't need to know that :)

There might be some confusion between the "log/storage snapshot" and the "state machine snapshot". Might be a good time to complete this TODO.

blathers-crl · 2024-09-28T22:19:04Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Epic: none Release note: none

sumeerbhola reviewed Sep 18, 2024

View reviewed changes

pav-kv commented Sep 18, 2024

View reviewed changes

pav-kv mentioned this pull request Sep 18, 2024

raft: fetch from log storage without holding Replica.mu #130955

Closed

tbg reviewed Sep 18, 2024

View reviewed changes

pav-kv mentioned this pull request Sep 18, 2024

raft: split Storage to LogStorage and StateStorage #130972

Merged

pav-kv force-pushed the raft-log-snapshot branch from 84ae1cd to 96219c0 Compare September 19, 2024 00:02

pav-kv force-pushed the raft-log-snapshot branch from 96219c0 to 82f263a Compare September 28, 2024 22:19

pav-kv force-pushed the raft-log-snapshot branch from 82f263a to f624f34 Compare September 28, 2024 22:31

kvserver: prototype sending MsgApp in pull mode

7748863

Epic: none Release note: none

pav-kv force-pushed the raft-log-snapshot branch from f624f34 to 7748863 Compare October 2, 2024 18:06

pav-kv mentioned this pull request Oct 2, 2024

raft: control sending MsgApp #128779

Closed

pav-kv closed this Oct 2, 2024

pav-kv deleted the raft-log-snapshot branch October 2, 2024 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: support sending MsgApp from a log snapshot #130932

raft: support sending MsgApp from a log snapshot #130932

pav-kv commented Sep 18, 2024

cockroach-teamcity commented Sep 18, 2024

sumeerbhola left a comment

pav-kv left a comment

pav-kv left a comment

tbg left a comment

tbg Sep 18, 2024

tbg Sep 18, 2024

pav-kv Sep 18, 2024 •

edited

Loading

pav-kv Sep 18, 2024 •

edited

Loading

pav-kv Sep 18, 2024 •

edited

Loading

blathers-crl bot commented Sep 28, 2024

	// Use the full slice expression to cause copy-on-write on this or a
	// subsequent (if a.entries is empty) append to u.entries. The truncated
	// part of the old slice can still be referenced elsewhere.
	keep := u.entries[:a.prev.index-u.prev.index]
	u.entries = append(keep[:len(keep):len(keep)], a.entries...)

raft: support sending MsgApp from a log snapshot #130932

raft: support sending MsgApp from a log snapshot #130932

Conversation

pav-kv commented Sep 18, 2024

cockroach-teamcity commented Sep 18, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg Sep 18, 2024

Choose a reason for hiding this comment

tbg Sep 18, 2024

Choose a reason for hiding this comment

pav-kv Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

pav-kv Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

pav-kv Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

blathers-crl bot commented Sep 28, 2024

pav-kv Sep 18, 2024 •

edited

Loading

pav-kv Sep 18, 2024 •

edited

Loading

pav-kv Sep 18, 2024 •

edited

Loading