kvserver: update TruncatedState before writing #131063

pav-kv · 2024-09-20T00:07:46Z

This PR fixes a truncation "race" that causes raft log entries being not found with ErrCompacted during normal operation.

Previously, truncations were carried out as follows:

Under Replica.raftMu, write a batch to Pebble that removes a prefix of the log and updates the RaftTruncatedState in the log storage.
Under Replica.{raftMu,mu}, move the in-memory RaftTruncatedState forward.

Between steps 1 and 2, there can be a goroutine holding Replica.mu that reads the log based on the previous RaftTruncatedState. It can unwillingly observe that the entries in (previous.Index, next.Index] interval are missing (the raft log returns ErrCompacted for them).

There are at least 2 Replica.mu-only RawNode.Step paths affected by this.

This PR swaps the order of the updates: the truncated state is moved forward first (along with updating the log size stats), signifying a logical deletion; and only then the entries are deleted from storage, physically. This removes the possibility of the race, and eliminates the need to handle ErrCompacted, as long as the reader respects the RaftTruncatedState when accessing the log. It makes the raft log always consistent under Replica.mu and/or Replica.raftMu.

Part of #132114, #143355
Related to #130955, #131041

cockroach-teamcity · 2024-09-20T00:07:58Z

This change is

tbg · 2024-09-24T11:36:21Z

I feel much more enlightened by our conversation than am by the commit message here. Below is my attempt at capturing the motivation and impact of this PR more fully, feel free to use pieces of it for an updated commit message as you deem appropriate.

Log truncations delete a prefix of the raft log. Currently, we commit the write batch that contains the truncation to pebble (under raftMu) prior to updating the in-memory metadata about the "beginning" of the log, the TruncatedState (under raftMu and replicaMu).
However, access to RawNode (the raft instance) is possible while holding only replicaMu.
We generally aim to perform no I/O under replicaMu, but as part of RACv2 (TODO issue ref) are introducing read "snapshots" that a RawNode can provide under only replicaMu and which can be read from at leisure outside of replicaMu (as long as another mechanism, in practice holding raftMu, prevents mutations of the log within the boundaries of the snapshot).
Having to consider a risk that RawNode attempts to access log entries that should still be present according to the TruncatedState but which have already been removed from the storage engine is undesirable complexity.
RawNode handles a "missing" log prefix somewhat gracefully (see ErrCompacted handling), but it is desirable to constrain the set of allowable behaviors as this significantly cuts down on the complexity of the system, which is important in light of the current RACv2 workstream and beyond.
To this end, this change updates the in-memory TruncatedState before committing the deletion, ensuring that RawNode under replicaMu will only ever assume the presence of portions of the log that aren't currently in the midst of a mutation.

tbg

Question about locking, but generally looks good!

pkg/kv/kvserver/replica_app_batch.go

sumeerbhola

We are going to grab LogSnapshot while holding raftMu (and holding Replica.mu) and will continue holding raftMu until we use the LogSnapshot. So it's unclear to me why this change is necessary -- if this change were straightforward, this would definitely make sense, but given errors in committing the truncation batch, it doesn't seem so.

Reviewed 5 of 7 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @pav-kv, and @tbg)

pkg/kv/kvserver/raft_log_truncator.go line 593 at r1 (raw file):

		r.setTruncatedState(trunc.RaftTruncatedState, trunc.expectedFirstIndex, trunc.isDeltaTrusted)
		r.setTruncationDelta(trunc.logDeltaBytes)
	})

What if the batch.Commit fails? Now we think the log is truncated when it is not.

pkg/kv/kvserver/raft_log_truncator.go line 608 at r1 (raw file):

			return
		}
		r.applySideEffects(ctx, &trunc.RaftTruncatedState)

Here we are passing a *kvserverpb.RaftTruncatedState and in setTruncatedState we pass a kvserverpb.RaftTruncatedState. Why this inconsistency?
We used to always a pointer before.

pkg/kv/kvserver/replica_application_result.go line 506 at r1 (raw file):

func (r *Replica) handleTruncatedStateResult(
	ctx context.Context, t *kvserverpb.RaftTruncatedState,
) (raftLogDelta int64) {

why do we delay clearing the cache entries if we've already updated Replica.mu.state.TruncatedState?

pav-kv

We are going to grab LogSnapshot while holding raftMu (and holding Replica.mu) and will continue holding raftMu until we use the LogSnapshot. So it's unclear to me why this change is necessary

"We" being RACv2 - yes. There is still a class of log storage reads done while only holding Replica.mu, 2 linked from the PR description. Since they don't lock raftMu before mu, they can load an outdated TruncatedState and observe a gap in the log.

This race is the only case that can cause ErrCompacted errors in raft codebase which is also exposed via the API etc/etc. I don't think it's worth keeping this for the sake of one race condition / quirk.

The way truncations are done is also inconsistent with how snapshots are handled (which are special kinds of truncations writing both to "log storage" and "state machine storage"). With snapshots, raft knows first, and registers an "intent" in memory not to try reading below this index (see unstable.snapshot field and unstable.maybeFirstIndex method, correspondingly). Only then the snapshot and log wipe are written/synced, and acked back to raft.

Log truncations invert this order: first the truncation is enacted, and only then TruncatedState is updated (which is an equivalent of notifying raft with an intent). When we get closer to separate-raft-log project again, this discrepancy will become more pressing, so I thought it's not worth waiting and fixed it.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/kv/kvserver/raft_log_truncator.go line 593 at r1 (raw file):

Previously, sumeerbhola wrote…

What if the batch.Commit fails? Now we think the log is truncated when it is not.

In the synchronous truncations flow applies can't fail. If they do, we would panic. In the decoupled truncations flow this seems more liberal, so a valid question.

What can be a reason of a fail here? Any legitimate ones?

I think: if a log index is already planned to be truncated at, it doesn't matter if the storage write fails. Logically, the prefix is already unused. If updating TruncatedState seems risky, we should make raft aware of truncations (it's a matter of adding one int to the unstable struct), and notify raft about this intent first thing.

pkg/kv/kvserver/raft_log_truncator.go line 608 at r1 (raw file):

Previously, sumeerbhola wrote…

Here we are passing a *kvserverpb.RaftTruncatedState and in setTruncatedState we pass a kvserverpb.RaftTruncatedState. Why this inconsistency?
We used to always a pointer before.

No reason, will fix.

pkg/kv/kvserver/replica_application_result.go line 506 at r1 (raw file):

Previously, sumeerbhola wrote…

why do we delay clearing the cache entries if we've already updated Replica.mu.state.TruncatedState?

Yeah, seems movable a bit up the stack, I'll consider. Though not critical.

sumeerbhola

There is still a class of log storage reads done while only holding Replica.mu, 2 linked from the PR description.

thanks for the pointer.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @pav-kv, and @tbg)

pkg/kv/kvserver/raft_log_truncator.go line 593 at r1 (raw file):

What can be a reason of a fail here? Any legitimate ones?

I couldn't think of one, and neither could Jackson. So adding a panic here is fine, which should unblock this PR. We just don't want a situation where we keep running without the log truncated and think it is truncated.

pav-kv · 2025-03-21T17:43:58Z

This PR is now placed on top the latest truncation stack clean-ups, #143271 and #143249. The last commit is the change, it's much more digestible now. Hold on reviews until the other PRs are done.

pav-kv

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)

pkg/kv/kvserver/raft_log_truncator.go line 593 at r1 (raw file):

Previously, sumeerbhola wrote…

What can be a reason of a fail here? Any legitimate ones?

I couldn't think of one, and neither could Jackson. So adding a panic here is fine, which should unblock this PR. We just don't want a situation where we keep running without the log truncated and think it is truncated.

Done.

pkg/kv/kvserver/replica_application_result.go line 506 at r1 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Yeah, seems movable a bit up the stack, I'll consider. Though not critical.

This is now in the right place.

pav-kv · 2025-03-22T00:50:26Z

@tbg @sumeerbhola All the pieces are now in the right places. There is some renaming / commenting to be done, but this PR is reviewable now. I also updated the PR description.

Update the Replica's in-memory TruncatedState before applying the write batch to storage. Readers of the raft log storage who synchronize with it via Replica.mu, and read TruncatedState, will then expect to find entries at indices > TruncatedState.Index in the log. If we write the batch first, and only then update TruncatedState, there is a time window during which the log storage appears to have a gap. Epic: none Release note: none

Epic: none Release note: none

tbg

⚡⚡⚡

tbg · 2025-03-24T12:41:18Z

pkg/kv/kvserver/replica_application_result.go


+// finalizeTruncationRaftMuLocked is a post-apply handler for the raft log


stage/apply sounds reasonable for naming at first glance, but then there are questions:

stage doesn't actually "stage" anything that really deletes the log entries: you'd expect it to populate a batch, or something like that, to be committed in finalize. The real staging either happens in the app batch (tight) or in the truncator (loose).

finalize similarly only deals with sideloaded entries

I can see how this all fits into how things currently work (the truncator needs a slim handle to a Replica to update the cache etc) but it would be useful to leave a blurb on these methods that hint at the bigger picture.

It might also be helpful to untether these methods from *Replica. Instead, raftTruncatorReplica could inline these methods directly, and in replicaAppBatch we could cast the *Replica as a raftTruncatorReplica for access to these methods. That way, we don't end up with methods that are confusing when considered as methods on *Replica in isolation.

Just suggestions - take any or leave it. Once we go back to only having "one kind of truncation", it will be easier to streamline this.

Naming is hard :) So we have 2.5 steps, essentially:

Logical truncation (called stagePendingTruncationRaftMuLocked here). Updates the in-memory RaftTruncatedState, size, etc. It begins looking like the truncation has been applied.

Physical truncation (could be fully asynchronous).
a. Writes a batch that carries out the same truncation in log storage.
b. More physical truncation (called finalize here). Removes sideloaded files after (a) is synced.

On where to put the methods: I'd like them to be part of replicaLogStorage / logstore or something like that. Need to think more about how to consolidate things: #136109.

tbg · 2025-03-24T12:43:40Z

pkg/kv/kvserver/replica_application_result.go

-	r.handleRaftLogDeltaResultRaftMuLockedReplicaMuLocked(pt.logDeltaBytes, isDeltaTrusted)
+	// Ensure the raft log size is not negative since it isn't persisted between
+	// server restarts.
+	// TODO(pav-kv): should we distrust the log size if it goes negative?


That makes sense to me at least. But (outside of the expectedFirstIndex==0 case where all bets are off) it should not be possible to arrive at a negative number through trusted updates, correct?

Yes, we shouldn't see a negative here, post all the cleanups. There are still a couple of possibilities:

The sideloaded storage size tracking is faulty and can silently skew the log size: logstore: sideloaded storage is not atomic #136416.

I don't have full confidence that the leader/leaseholder-evaluated size delta exactly matches what the local replica would have computed. Maybe not today, but I could imagine mixed-version scenarios in which raft log encodings could differ.

In some future, I would have liked if the local replica keeps track of its own size precisely and in a self-contained way. One way to achieve this is #136358 and a more general approach is here.

sumeerbhola

Reviewed 4 of 14 files at r20, 5 of 7 files at r21, 1 of 1 files at r23, 2 of 2 files at r24, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

pav-kv requested review from tbg, nvanbenschoten and sumeerbhola September 20, 2024 00:07

pav-kv requested a review from a team as a code owner September 20, 2024 00:07

pav-kv mentioned this pull request Sep 20, 2024

kvserver: document raft Storage mental model #131041

Merged

tbg reviewed Sep 24, 2024

View reviewed changes

pkg/kv/kvserver/replica_app_batch.go Outdated Show resolved Hide resolved

sumeerbhola requested a review from tbg September 25, 2024 18:09

sumeerbhola requested changes Sep 25, 2024

View reviewed changes

pav-kv commented Sep 25, 2024

View reviewed changes

sumeerbhola reviewed Sep 25, 2024

View reviewed changes

tbg requested review from tbg and removed request for tbg September 26, 2024 07:28

pav-kv force-pushed the update-truncated-state-before-writing branch from 4fe206e to 2561983 Compare March 21, 2025 17:41

pav-kv marked this pull request as draft March 21, 2025 17:42

pav-kv removed the request for review from nvanbenschoten March 21, 2025 17:42

pav-kv force-pushed the update-truncated-state-before-writing branch 4 times, most recently from b8e0954 to 9a3d8f0 Compare March 22, 2025 00:42

pav-kv commented Mar 22, 2025

View reviewed changes

pav-kv marked this pull request as ready for review March 22, 2025 00:50

pav-kv force-pushed the update-truncated-state-before-writing branch 3 times, most recently from bff7f34 to 407e68a Compare March 22, 2025 17:43

pav-kv added 4 commits March 22, 2025 22:19

kvserver: inline raft log size updates

1576e79

Epic: none Release note: none

kvserver: reuse pt.isDeltaTrusted

53e8db0

Epic: none Release note: none

kvserver: update truncator datadriven tests

90dd7ee

Epic: none Release note: none

pav-kv force-pushed the update-truncated-state-before-writing branch from 4139769 to 90dd7ee Compare March 22, 2025 22:20

pav-kv mentioned this pull request Mar 24, 2025

kvserver: clean up and fix raft log truncations #143355

Open

10 tasks

tbg self-requested a review March 24, 2025 12:27

pav-kv requested a review from sumeerbhola March 24, 2025 12:40

tbg approved these changes Mar 24, 2025

View reviewed changes

sumeerbhola approved these changes Mar 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: update TruncatedState before writing #131063

kvserver: update TruncatedState before writing #131063

pav-kv commented Sep 20, 2024 •

edited

Loading

cockroach-teamcity commented Sep 20, 2024

tbg commented Sep 24, 2024

tbg left a comment

sumeerbhola left a comment

pav-kv left a comment

pav-kv left a comment

sumeerbhola left a comment

pav-kv commented Mar 21, 2025 •

edited

Loading

pav-kv left a comment

pav-kv commented Mar 22, 2025 •

edited

Loading

tbg left a comment

tbg Mar 24, 2025

pav-kv Mar 24, 2025 •

edited

Loading

tbg Mar 24, 2025

pav-kv Mar 24, 2025 •

edited

Loading

sumeerbhola left a comment


		// finalizeTruncationRaftMuLocked is a post-apply handler for the raft log

kvserver: update TruncatedState before writing #131063

Are you sure you want to change the base?

kvserver: update TruncatedState before writing #131063

Conversation

pav-kv commented Sep 20, 2024 • edited Loading

cockroach-teamcity commented Sep 20, 2024

tbg commented Sep 24, 2024

tbg left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv commented Mar 21, 2025 • edited Loading

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv commented Mar 22, 2025 • edited Loading

tbg left a comment

Choose a reason for hiding this comment

tbg Mar 24, 2025

Choose a reason for hiding this comment

pav-kv Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

tbg Mar 24, 2025

Choose a reason for hiding this comment

pav-kv Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pav-kv commented Sep 20, 2024 •

edited

Loading

pav-kv commented Mar 21, 2025 •

edited

Loading

pav-kv commented Mar 22, 2025 •

edited

Loading

pav-kv Mar 24, 2025 •

edited

Loading

pav-kv Mar 24, 2025 •

edited

Loading