-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: support sending MsgApp from a log snapshot #130932
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)
pkg/kv/kvserver/replica_raft.go
line 1471 at r4 (raw file):
// FIXME: should know the parameters, as instructed by the send streams. var after, last, maxSize uint64 slices[peer] = logSnap.LogSlice(after, last, maxSize)
This will happen inside the replicaSendStream
and not centralized here. But the LogSlice
API looks good (reminder: we want at least one entry in LogSlice
even if it exceeds maxSize
).
pkg/raft/log.go
line 533 at r3 (raw file):
return LogSnapshot{ first: l.firstIndex(), unstable: l.unstable.logSlice,
is a shallow copy of logSlice.entries
ok because no one modifies the indices after appending?
pkg/raft/raft.go
line 673 at r4 (raw file):
commit := r.raftLog.committed // Send the MsgApp, and update the progress accordingly. r.send(pb.Message{
This will delay sending until the next Ready, which is not desirable. We have already deducted tokens for it and we don't want to have to schedule on the raft scheduler again. I would like it to behave as makeMsgAppAndAssumeSent(...) raftpb.Message
so that caller (replicaSendStream
) can immediately send. The replicaSendStream
will hold Replica.mu
when calling this, since there is no IO happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/replica_raft.go
line 1471 at r4 (raw file):
Previously, sumeerbhola wrote…
This will happen inside the
replicaSendStream
and not centralized here. But theLogSlice
API looks good (reminder: we want at least one entry inLogSlice
even if it exceedsmaxSize
).
Ack. Yeah, at least one entry semantics works, I just didn't elaborate it in the comment.
pkg/raft/log.go
line 533 at r3 (raw file):
Previously, sumeerbhola wrote…
is a shallow copy of
logSlice.entries
ok because no one modifies the indices after appending?
Yes. Internally, raft only appends to the unstable
slice. If it ever needs to regress / truncate the log and overwrite an index, it reallocates.
cockroach/pkg/raft/log_unstable.go
Lines 321 to 325 in 1e27189
// Use the full slice expression to cause copy-on-write on this or a | |
// subsequent (if a.entries is empty) append to u.entries. The truncated | |
// part of the old slice can still be referenced elsewhere. | |
keep := u.entries[:a.prev.index-u.prev.index] | |
u.entries = append(keep[:len(keep):len(keep)], a.entries...) |
pkg/raft/raft.go
line 673 at r4 (raw file):
Previously, sumeerbhola wrote…
This will delay sending until the next Ready, which is not desirable. We have already deducted tokens for it and we don't want to have to schedule on the raft scheduler again. I would like it to behave as
makeMsgAppAndAssumeSent(...) raftpb.Message
so that caller (replicaSendStream
) can immediately send. ThereplicaSendStream
will holdReplica.mu
when calling this, since there is no IO happening.
This is doable. Though who runs this? Is this run from within the raft scheduler thread? If so, this can be done right before Ready
or inlined in Ready
handler, and we wouldn't need a second scheduling event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/raft/log.go
line 533 at r3 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
Yes. Internally, raft only appends to the
unstable
slice. If it ever needs to regress / truncate the log and overwrite an index, it reallocates.cockroach/pkg/raft/log_unstable.go
Lines 321 to 325 in 1e27189
// Use the full slice expression to cause copy-on-write on this or a // subsequent (if a.entries is empty) append to u.entries. The truncated // part of the old slice can still be referenced elsewhere. keep := u.entries[:a.prev.index-u.prev.index] u.entries = append(keep[:len(keep):len(keep)], a.entries...)
This safety is pre-existing. We already return unstable
sub-slices via Ready()
, which then get sent to async log storage. So unstable
protects itself and the caller from mutating the slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad strokes of this look good!
pkg/raft/log.go
Outdated
type LogSnapshot struct { | ||
first uint64 | ||
unstable logSlice | ||
storage Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unexpected to me that the Storage
is part of the snapshot. The snapshot provides a "view" into the Storage, assuming the Storage
isn't mutated. But the snapshot doesn't "snapshot" the Storage
. So it's worth considering whether all the methods on a LogSnapshot
that actually need to do IO should take the Storage
as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That said, if this is all complicated to disentagle, what we have here is fine and since we need to pass this LogSnapshot
around, it could be difficult to do the plumbing for Storage
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could make a Storage
wrapper which asserts that is wasn't mutated. E.g. I could imagine incrementing some (in memory only) "epoch" int in it whenever we mutate the log storage. Then make a lean wrapper around our Storage
impl which, on every call, asserts that the epoch hasn't changed.
type StorageRO interface {
// everything we have today in Storage
}
type Storage interface {
StorageRO
Snapshot() StorageRO
}
// somewhere in replica_raftstorage.go
func (r *replicaRaftStorage) Snapshot() raft.StorageRO {
return wrapper{r: r, epoch: r.epoch}
}
func (w wrapper) Entries(...) {
if w.r.epoch != w.epoch {
log.Fatalf()
}
return w.r.Entries(...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are faking storage snapshots by just requiring that storage is not mutated for a while (abusing the fact that raftMu
is held). But raft
doesn't need to know that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some confusion between the "log/storage snapshot" and the "state machine snapshot". Might be a good time to complete this TODO.
84ae1cd
to
96219c0
Compare
96219c0
to
82f263a
Compare
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
82f263a
to
f624f34
Compare
Epic: none Release note: none
f624f34
to
7748863
Compare
Epic: none
Release note: none