storage: snapshot/log truncation/replica GC badness tracking issue #32046

tbg · 2018-10-30T22:48:46Z

I have too many investigations across the various roachtest import/restore failures. This is an authoritative list of problems that I don't want to lose track of.

From #30261 (comment)

preemptive snapshots can be removed (gc'ed) before the upreplication had completed (semi-tracked in storage: avoid errant Raft snapshots during splits #31875)
the Raft snapshot queue is mostly LIFO
that a three member raft group with one replica requiring a snapshot decided that it was a good idea to add a fourth replica (at which point it's losing quorum).
the Raft snapshot semaphore on the receiver can back up all too easily when multiple nodes are sending snapshots to it, slowing down the Raft log queues on the senders

From #31409 (comment) and surroundings:

leaseholder transfers right after a split seem to be a thing, and in particular it looks like the lease was transferred to a replica requiring a snapshot
range with applied index = 10, truncated index = 10 needs a snapshot even though leader has truncated index = 10 too (so it could just append 11 instead)
no good way to grab a Raft status for a range (this is actually probably possible via the raw endpoint)

Higher up in the thread: #31409 (comment)

took 13 min for rebalance queue to pick up a replica (maybe fallout from other problems at fault)
~~we have a mechanism that refuses Raft snapshots based on log size, which is a recipe for disaster as reducing log size needs a truncation which needs quorum. We should never ever refuse Raft snapshots~~. It only aborts preemptive snapshots, but the mechanism should be updated so that snapshot aborted == raft log queue would truncate, so that we can just add to the queue reactively.

assorted:

no problem ranges reported even though this is in the alerts:
```
node_id store_id        category        description     value
1       1       metrics requests.slow.raft      1
1       1       metrics queue.raftsnapshot.process.failure      1
8       8       metrics queue.raftsnapshot.process.failure      10
```
Many of the problems stem from ranges for which a replica is blocked on a Raft snapshot to get a split trigger. These ranges should be detected as "underreplicated" or "follower needs snapshot" or the like, not simply ignored. example. Even worse, straight-up unavailable ranges are also omitted
large numbers of "raft log too large" on problem ranges page (and in practice)
Likely there are some rough edges around the anomalous case of SST ingestion where one entry already blows the raft log max size. Example
replicas needing snapshots are not marked as problem ranges, and it's not obvious from the range status page that they need one

from #32046 (comment):

size-based truncation shouldn't simply abandon followers if they aren't far behind. Seen frequently on the NodeLivenessMax-Tsd range which catches inline puts to the node status.
size-based truncation appears too aggressive on small replicas.
raft log stats shouldn't compare apples and oranges: timeseries and ssts are difficult to track correctly and deltas can compound over time. Instead, reset the raft log size to an absolute, known value with each truncation. This isn't trivial since we don't want to recompute downstream of Raft.
max size doesn't take into account that proposals can be in the log but not part of any snapshot (i.e. because they're uncommitted). Not sure if that's even a problem but it leads to empty ranges sometimes getting log truncated because of a nonempty raft log

#32437 via @petermattis

Rework Raft snapshot queue to Use a size-based quota system for concurrent snapshot application. Instead of limiting the number of concurrent snapshots based on count, we'd limit based on bytes so that a large number of tiny snapshots could be allowed concurrently.

The text was updated successfully, but these errors were encountered:

petermattis · 2018-10-31T00:18:10Z

we have a mechanism that refuses Raft snapshots based on log size, which is a recipe for disaster as reducing log size needs a truncation which needs quorum. We should never ever refuse Raft snapshots.

This was a relatively recent addition: dda4bc9. It looked reasonable at the time. Cc @bdarnell.

petermattis · 2018-10-31T00:21:23Z

the Raft snapshot semaphore on the receiver can back up all too easily when multiple nodes are sending snapshots to it, slowing down the Raft log queues on the senders

This seems like a symptom rather than a root cause. Allowing more concurrency here can lead to death spirals when we overwhelm the receiver. You're much closer to this now than I am. Perhaps you have some interesting thoughts about something that can be changed.

tbg · 2018-10-31T07:51:37Z

This seems like a symptom rather than a root cause. Allowing more concurrency here can lead to death spirals when we overwhelm the receiver. You're much closer to this now than I am. Perhaps you have some interesting thoughts about something that can be changed.

The problem I see here is not that there's throttling at the receiver, but that the throttling will back up the sender, who could spend the time sending snapshots to other nodes that need them.

I don't fully understand why this semaphore gets that backed up (lack of introspection is one part of the problem), but if we sent 64mb snapshots at 8mb/s we'd clock in at 8s/snapshot. Being backed up after 3-4 of them would easily do it, and how much you can be backed up by scales by the number of nodes, which is typically 8-11 in my experiments.

tbg · 2018-10-31T07:54:45Z

This was a relatively recent addition: dda4bc9. It looked reasonable at the time. Cc @bdarnell.

Ah, I missed that we're only refusing preemptive snapshots. That defuses the situation, though the check doesn't need to compute the Raft log size in the first place (assuming we trust our computed number, which ... well we don't).

benesch · 2018-10-31T15:37:20Z

that a three member raft group with one replica requiring a snapshot decided that it was a good idea to add a fourth replica (at which point it's losing quorum).

This could be merge related. I don't think the merge queue has any protection against causing this situation.

tbg · 2018-10-31T16:31:59Z

Merge queue is off, but hold on to that thought. Also, if you want to get rid of a follower, our strategy now is to upreplicate then downreplicate. This means upreplicating when there's a lame duck in the group is necessary. We just have to make sure that we're not losing the preemptive snapshot, which brings us back to reigning in the overly aggressive replicaGC. I have a (ticks-based) WIP in my latest round of experiments.

bdarnell · 2018-10-31T17:19:00Z

The problem I see here is not that there's throttling at the receiver, but that the throttling will back up the sender, who could spend the time sending snapshots to other nodes that need them.

This could be addressed with either a deadline (to unblock the thread and allow it to try a different range/replica) or more parallelism on the sender (run the queue with a higher concurrency, but then limit it back down with a sender-side semaphore once the recipient has accepted the snapshot reservation).

There's also #14768: Preemptive snapshots are limited to 2MiB/sec while raft snapshots get 8MiB/sec. However, we don't queue them separately, so a higher-priority raft snapshot may be blocked until a throttled preemptive snapshot has completed.

This probably isn't going to be close to the code that will eventually get checked in, but I wanted to get the conversation started. I don't have concrete evidence that this problem is a root cause of \cockroachdb#32046, however I want to address it (at the very least for cockroachdb#31875). I have to dig in more, but what I'm seeing in various import/tpch flavor tests is that the split-scatter phase is extremely aggressive in splitting and then downreplicating overreplicated ranges. For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly be split (and its LHS and RHS split again, multiple times) while, say, n4 is removed. I think that in this kind of situation, needing one Raft snapshot quickly implies needing ~O(splits) Raft snapshots. This is because splitting a range on which one replica requires a Raft snapshot you end up with two ranges that do. The implication is that we don't want to need Raft snapshots (and perhaps also: we want to go easy on splitting ranges for which one replica already needs a snapshot). On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in pending snapshots from zero to 5k (resolved within minutes) was observed. A run of import/tpch/nodes=8 typically shows a rapid increase from zero to ~1k which only dissipates after the import returns. This variation may be random, or it may indicate that the import test is a lot more aggressive for some reason. I have to look into the details, but the following script results in a number of Raft snapshots (dozens). This may already be fixed by other PRs such as cockroachdb#31875, though. Easy to verify. ---- An upreplication begins by sending a preemptive snapshot, followed by a transaction which "officially" adds the new member to the the Raft group. This leaves a (typically small) window during which the replicaGC queue could pick up the preemptive snapshot and delete it. This is unfortunate as it leaves the range in a fragile state, with one follower requiring a Raft snapshot to catch up. This commit introduces a heuristic that holds off on GC'ing replicas that look like preemptive snapshots until they've been around for a while. Release note: None

This makes it a lot easier to log descriptive debug messages indicating how a truncation decision was arrived at, and in particular allows pointing the finger at truncations that lead to Raft snapshots, which is relevant in the context of cockroachdb#32046. Release note: None

32137: storage: refactor log truncation index computation r=petermattis a=tschottdorf This makes it a lot easier to log descriptive debug messages indicating how a truncation decision was arrived at, and in particular allows pointing the finger at truncations that lead to Raft snapshots, which is relevant in the context of #32046. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg · 2018-11-15T12:29:14Z

The log size truncation is definitely still janky. We know it can undercount, but the behavior here is rather the opposite: we set the "max raft log size" to the size of the replica, which is really small (74KiB). The Raft log size includes the overhead of the proposal, etc, which the replica size doesn't eat. So we see an endless stream of premature proposals. I don't think any of these really cause a snapshot (because the latest log entries likely have been sent to the client at this point) but it seems silly, especially since this is a condition I want to be able to keep logging.

logs/cockroach.log:I181115 11:14:10.668506 67875 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5274 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:14:50.657125 68083 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5295 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:15:30.630221 68241 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 3 entries to first index 5313 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:15:40.650090 68348 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5320 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:16:10.675992 68501 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5335 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:16:50.650678 68697 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5355 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:17:10.685852 68769 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5364 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:17:30.656791 68874 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5375 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:17:50.640727 68984 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5385 (chosen via: quorum); log too large (109 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:18:00.642628 69017 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5389 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:18:30.665122 69135 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5405 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:18:50.663119 69218 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5415 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:19:10.656308 69295 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5425 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:19:20.711898 69398 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5430 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:19:50.695048 69503 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5446 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:10.637142 69610 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5456 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:20.638040 69629 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5461 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:30.669153 69667 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5466 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:50.657441 69715 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5476 (chosen via: quorum); log too large (109 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:10.662264 69866 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5486 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:30.652276 69959 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5496 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:40.661806 69980 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5501 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:50.677477 70048 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5507 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:22:10.680383 70229 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5517 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:22:40.678359 70341 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5532 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:23:10.669672 70450 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5547 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:24:00.672384 70700 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5572 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:24:20.674592 70755 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5582 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:25:00.683632 70983 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5602 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:25:20.654189 71090 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5612 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:26:10.723236 71279 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5640 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:26:30.670779 71415 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5650 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:26:50.648489 71438 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5660 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:27:00.658103 71513 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5665 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:27:10.681703 71594 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5670 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:27:40.652636 71732 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5685 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:28:00.663801 71796 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5695 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:28:10.671749 71886 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5699 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:29:10.666881 72129 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5732 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:29:20.667809 72171 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5738 (chosen via: quorum); log too large (109 KiB > 74 KiB); implies 1 Raft snapshot

tbg · 2018-11-15T12:59:59Z

The writes to that range are the node statuses, which are large (tens of ks) and inline (i.e. each write replaces a previous write). So the replica size is mostly constant, whereas the Raft log grows fairly quickly. Under the configured behavior, it's doing the right thing, but I would argue that it shouldn't be truncating all the way up to the quorum index quite as aggressively. Truncating to the joint commit index (taking into account live nodes only) would fare better. I added a bullet to that effect in the initial list.

tbg · 2018-11-15T13:06:27Z

Another bad-looking thing:

I181115 12:25:04.940211 70549 storage/raft_log_queue.go:356  [n1,raftlog,s1,r3440/1:/Table/53/4/90{1444/…-3657/…}] truncate 6 entries to first index 17 (chosen via: quorum); log too large (17 MiB > 0 B)

0B seems to correspond to the replica size. I can only assume that an SST ingestion was proposed but bounced downstream of Raft and the truncation hit the small window before it was reproposed. We shouldn't use a size of zero to justify a truncation. (The "other limit" 4MB would've triggered the same truncation, so this is really a case of doing the right thing for the wrong reason).

Added a bullet for this as well.

@nvanbenschoten

cc @nvanbenschoten. I'm going to run some kv95 experiments in which I vary the 64kb threshold in both directions to see if there are any effects on performance in doing so. ---- Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches cockroachdb#32046. Release note: None

Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches cockroachdb#32046. Release note: None

@nvanbenschoten

32437: storage: truncate aggressively only after 4mb of logs r=nvanbenschoten,petermattis a=tbg cc @nvanbenschoten. I'm going to run some kv95 experiments in which I vary the 64kb threshold in both directions to see if there are any effects on performance in doing so. ---- Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches #32046. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

This makes it a lot easier to log descriptive debug messages indicating how a truncation decision was arrived at, and in particular allows pointing the finger at truncations that lead to Raft snapshots, which is relevant in the context of cockroachdb#32046. Release note: None

Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches cockroachdb#32046. Release note: None

We know there can be a backlog of Raft snapshots at the beginning of the test. This isn't ideal, but we know about it and have cockroachdb#32046 tracking it. Closes cockroachdb#32859. Release note: None

33011: roachtest: don't fail tests based on slow health checker r=petermattis a=tbg We know there can be a backlog of Raft snapshots at the beginning of the test. This isn't ideal, but we know about it and have #32046 tracking it. Closes #32859. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting A-kv-distribution Relating to rebalancing and leasing. labels Oct 30, 2018

tbg added this to the 2.2 milestone Oct 30, 2018

tbg self-assigned this Oct 30, 2018

tbg mentioned this issue Nov 3, 2018

storage: refactor log truncation index computation #32137

Merged

tbg mentioned this issue Nov 3, 2018

[wip] avoid accidental GC of preemptive snapshots #32139

Closed

This was referenced Nov 12, 2018

changefeedccl: occasional inconsistent performance on long-running tpcc-1000 #32104

Closed

storage: avoid errant Raft snapshots during splits #31875

Merged

tbg mentioned this issue Nov 16, 2018

storage: truncate aggressively only after 4mb of logs #32437

Merged

tbg mentioned this issue Nov 21, 2018

storage: replicate queue can cause unavailable ranges due to snapshots #32525

Closed

tbg mentioned this issue Dec 11, 2018

roachtest: don't fail tests based on slow health checker #33011

Merged

tbg added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Jan 8, 2019

tbg mentioned this issue Jan 16, 2019

raft: use learner replicas instead of preemptive snapshots #34058

Closed

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: snapshot/log truncation/replica GC badness tracking issue #32046

storage: snapshot/log truncation/replica GC badness tracking issue #32046

tbg commented Oct 30, 2018 •

edited

Loading

petermattis commented Oct 31, 2018

petermattis commented Oct 31, 2018

tbg commented Oct 31, 2018

tbg commented Oct 31, 2018

benesch commented Oct 31, 2018

tbg commented Oct 31, 2018

bdarnell commented Oct 31, 2018

tbg commented Nov 15, 2018

tbg commented Nov 15, 2018

tbg commented Nov 15, 2018

storage: snapshot/log truncation/replica GC badness tracking issue #32046

storage: snapshot/log truncation/replica GC badness tracking issue #32046

Comments

tbg commented Oct 30, 2018 • edited Loading

petermattis commented Oct 31, 2018

petermattis commented Oct 31, 2018

tbg commented Oct 31, 2018

tbg commented Oct 31, 2018

benesch commented Oct 31, 2018

tbg commented Oct 31, 2018

bdarnell commented Oct 31, 2018

tbg commented Nov 15, 2018

tbg commented Nov 15, 2018

tbg commented Nov 15, 2018

tbg commented Oct 30, 2018 •

edited

Loading