Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: support sending MsgApp from a log snapshot #130932

Closed
wants to merge 1 commit into from

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Sep 18, 2024

Epic: none
Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)


pkg/kv/kvserver/replica_raft.go line 1471 at r4 (raw file):

		// FIXME: should know the parameters, as instructed by the send streams.
		var after, last, maxSize uint64
		slices[peer] = logSnap.LogSlice(after, last, maxSize)

This will happen inside the replicaSendStream and not centralized here. But the LogSlice API looks good (reminder: we want at least one entry in LogSlice even if it exceeds maxSize).


pkg/raft/log.go line 533 at r3 (raw file):

	return LogSnapshot{
		first:    l.firstIndex(),
		unstable: l.unstable.logSlice,

is a shallow copy of logSlice.entries ok because no one modifies the indices after appending?


pkg/raft/raft.go line 673 at r4 (raw file):

	commit := r.raftLog.committed
	// Send the MsgApp, and update the progress accordingly.
	r.send(pb.Message{

This will delay sending until the next Ready, which is not desirable. We have already deducted tokens for it and we don't want to have to schedule on the raft scheduler again. I would like it to behave as makeMsgAppAndAssumeSent(...) raftpb.Message so that caller (replicaSendStream) can immediately send. The replicaSendStream will hold Replica.mu when calling this, since there is no IO happening.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/kv/kvserver/replica_raft.go line 1471 at r4 (raw file):

Previously, sumeerbhola wrote…

This will happen inside the replicaSendStream and not centralized here. But the LogSlice API looks good (reminder: we want at least one entry in LogSlice even if it exceeds maxSize).

Ack. Yeah, at least one entry semantics works, I just didn't elaborate it in the comment.


pkg/raft/log.go line 533 at r3 (raw file):

Previously, sumeerbhola wrote…

is a shallow copy of logSlice.entries ok because no one modifies the indices after appending?

Yes. Internally, raft only appends to the unstable slice. If it ever needs to regress / truncate the log and overwrite an index, it reallocates.

// Use the full slice expression to cause copy-on-write on this or a
// subsequent (if a.entries is empty) append to u.entries. The truncated
// part of the old slice can still be referenced elsewhere.
keep := u.entries[:a.prev.index-u.prev.index]
u.entries = append(keep[:len(keep):len(keep)], a.entries...)


pkg/raft/raft.go line 673 at r4 (raw file):

Previously, sumeerbhola wrote…

This will delay sending until the next Ready, which is not desirable. We have already deducted tokens for it and we don't want to have to schedule on the raft scheduler again. I would like it to behave as makeMsgAppAndAssumeSent(...) raftpb.Message so that caller (replicaSendStream) can immediately send. The replicaSendStream will hold Replica.mu when calling this, since there is no IO happening.

This is doable. Though who runs this? Is this run from within the raft scheduler thread? If so, this can be done right before Ready or inlined in Ready handler, and we wouldn't need a second scheduling event.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/raft/log.go line 533 at r3 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Yes. Internally, raft only appends to the unstable slice. If it ever needs to regress / truncate the log and overwrite an index, it reallocates.

// Use the full slice expression to cause copy-on-write on this or a
// subsequent (if a.entries is empty) append to u.entries. The truncated
// part of the old slice can still be referenced elsewhere.
keep := u.entries[:a.prev.index-u.prev.index]
u.entries = append(keep[:len(keep):len(keep)], a.entries...)

This safety is pre-existing. We already return unstable sub-slices via Ready(), which then get sent to async log storage. So unstable protects itself and the caller from mutating the slice.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad strokes of this look good!

pkg/raft/log.go Outdated
type LogSnapshot struct {
first uint64
unstable logSlice
storage Storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unexpected to me that the Storage is part of the snapshot. The snapshot provides a "view" into the Storage, assuming the Storage isn't mutated. But the snapshot doesn't "snapshot" the Storage. So it's worth considering whether all the methods on a LogSnapshot that actually need to do IO should take the Storage as a parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That said, if this is all complicated to disentagle, what we have here is fine and since we need to pass this LogSnapshot around, it could be difficult to do the plumbing for Storage).

Copy link
Collaborator Author

@pav-kv pav-kv Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could make a Storage wrapper which asserts that is wasn't mutated. E.g. I could imagine incrementing some (in memory only) "epoch" int in it whenever we mutate the log storage. Then make a lean wrapper around our Storage impl which, on every call, asserts that the epoch hasn't changed.

type StorageRO interface {
	// everything we have today in Storage
}

type Storage interface {
	StorageRO
	Snapshot() StorageRO
}


// somewhere in replica_raftstorage.go
func (r *replicaRaftStorage) Snapshot() raft.StorageRO {
	return wrapper{r: r, epoch: r.epoch}
}

func (w wrapper) Entries(...) {
	if w.r.epoch != w.epoch {
		log.Fatalf()
	}
	return w.r.Entries(...)
}

Copy link
Collaborator Author

@pav-kv pav-kv Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are faking storage snapshots by just requiring that storage is not mutated for a while (abusing the fact that raftMu is held). But raft doesn't need to know that :)

Copy link
Collaborator Author

@pav-kv pav-kv Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some confusion between the "log/storage snapshot" and the "state machine snapshot". Might be a good time to complete this TODO.

Copy link

blathers-crl bot commented Sep 28, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@pav-kv pav-kv force-pushed the raft-log-snapshot branch from f624f34 to 7748863 Compare October 2, 2024 18:06
@pav-kv pav-kv closed this Oct 2, 2024
@pav-kv pav-kv deleted the raft-log-snapshot branch October 2, 2024 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants