Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: snapshot/log truncation/replica GC badness tracking issue #32046

Closed
1 of 17 tasks
tbg opened this issue Oct 30, 2018 · 10 comments
Closed
1 of 17 tasks

storage: snapshot/log truncation/replica GC badness tracking issue #32046

tbg opened this issue Oct 30, 2018 · 10 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) X-stale
Milestone

Comments

@tbg
Copy link
Member

tbg commented Oct 30, 2018

I have too many investigations across the various roachtest import/restore failures. This is an authoritative list of problems that I don't want to lose track of.

From #30261 (comment)

  • preemptive snapshots can be removed (gc'ed) before the upreplication had completed (semi-tracked in storage: avoid errant Raft snapshots during splits #31875)
  • the Raft snapshot queue is mostly LIFO
  • that a three member raft group with one replica requiring a snapshot decided that it was a good idea to add a fourth replica (at which point it's losing quorum).
  • the Raft snapshot semaphore on the receiver can back up all too easily when multiple nodes are sending snapshots to it, slowing down the Raft log queues on the senders

From #31409 (comment) and surroundings:

  • leaseholder transfers right after a split seem to be a thing, and in particular it looks like the lease was transferred to a replica requiring a snapshot
  • range with applied index = 10, truncated index = 10 needs a snapshot even though leader has truncated index = 10 too (so it could just append 11 instead)
  • no good way to grab a Raft status for a range (this is actually probably possible via the raw endpoint)

Higher up in the thread: #31409 (comment)

  • took 13 min for rebalance queue to pick up a replica (maybe fallout from other problems at fault)
  • we have a mechanism that refuses Raft snapshots based on log size, which is a recipe for disaster as reducing log size needs a truncation which needs quorum. We should never ever refuse Raft snapshots. It only aborts preemptive snapshots, but the mechanism should be updated so that snapshot aborted == raft log queue would truncate, so that we can just add to the queue reactively.

assorted:

  • no problem ranges reported even though this is in the alerts:
    node_id store_id        category        description     value
    1       1       metrics requests.slow.raft      1
    1       1       metrics queue.raftsnapshot.process.failure      1
    8       8       metrics queue.raftsnapshot.process.failure      10
    
    Many of the problems stem from ranges for which a replica is blocked on a Raft snapshot to get a split trigger. These ranges should be detected as "underreplicated" or "follower needs snapshot" or the like, not simply ignored. example. Even worse, straight-up unavailable ranges are also omitted
  • large numbers of "raft log too large" on problem ranges page (and in practice)
    Likely there are some rough edges around the anomalous case of SST ingestion where one entry already blows the raft log max size. Example
  • replicas needing snapshots are not marked as problem ranges, and it's not obvious from the range status page that they need one

from #32046 (comment):

  • size-based truncation shouldn't simply abandon followers if they aren't far behind. Seen frequently on the NodeLivenessMax-Tsd range which catches inline puts to the node status.
  • size-based truncation appears too aggressive on small replicas.
  • raft log stats shouldn't compare apples and oranges: timeseries and ssts are difficult to track correctly and deltas can compound over time. Instead, reset the raft log size to an absolute, known value with each truncation. This isn't trivial since we don't want to recompute downstream of Raft.
  • max size doesn't take into account that proposals can be in the log but not part of any snapshot (i.e. because they're uncommitted). Not sure if that's even a problem but it leads to empty ranges sometimes getting log truncated because of a nonempty raft log

#32437 via @petermattis

  • Rework Raft snapshot queue to Use a size-based quota system for concurrent snapshot application. Instead of limiting the number of concurrent snapshots based on count, we'd limit based on bytes so that a large number of tiny snapshots could be allowed concurrently.
@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting A-kv-distribution Relating to rebalancing and leasing. labels Oct 30, 2018
@tbg tbg added this to the 2.2 milestone Oct 30, 2018
@tbg tbg self-assigned this Oct 30, 2018
@petermattis
Copy link
Collaborator

we have a mechanism that refuses Raft snapshots based on log size, which is a recipe for disaster as reducing log size needs a truncation which needs quorum. We should never ever refuse Raft snapshots.

This was a relatively recent addition: dda4bc9. It looked reasonable at the time. Cc @bdarnell.

@petermattis
Copy link
Collaborator

the Raft snapshot semaphore on the receiver can back up all too easily when multiple nodes are sending snapshots to it, slowing down the Raft log queues on the senders

This seems like a symptom rather than a root cause. Allowing more concurrency here can lead to death spirals when we overwhelm the receiver. You're much closer to this now than I am. Perhaps you have some interesting thoughts about something that can be changed.

@tbg
Copy link
Member Author

tbg commented Oct 31, 2018

This seems like a symptom rather than a root cause. Allowing more concurrency here can lead to death spirals when we overwhelm the receiver. You're much closer to this now than I am. Perhaps you have some interesting thoughts about something that can be changed.

The problem I see here is not that there's throttling at the receiver, but that the throttling will back up the sender, who could spend the time sending snapshots to other nodes that need them.

I don't fully understand why this semaphore gets that backed up (lack of introspection is one part of the problem), but if we sent 64mb snapshots at 8mb/s we'd clock in at 8s/snapshot. Being backed up after 3-4 of them would easily do it, and how much you can be backed up by scales by the number of nodes, which is typically 8-11 in my experiments.

@tbg
Copy link
Member Author

tbg commented Oct 31, 2018

This was a relatively recent addition: dda4bc9. It looked reasonable at the time. Cc @bdarnell.

Ah, I missed that we're only refusing preemptive snapshots. That defuses the situation, though the check doesn't need to compute the Raft log size in the first place (assuming we trust our computed number, which ... well we don't).

@benesch
Copy link
Contributor

benesch commented Oct 31, 2018

that a three member raft group with one replica requiring a snapshot decided that it was a good idea to add a fourth replica (at which point it's losing quorum).

This could be merge related. I don't think the merge queue has any protection against causing this situation.

@tbg
Copy link
Member Author

tbg commented Oct 31, 2018

Merge queue is off, but hold on to that thought. Also, if you want to get rid of a follower, our strategy now is to upreplicate then downreplicate. This means upreplicating when there's a lame duck in the group is necessary. We just have to make sure that we're not losing the preemptive snapshot, which brings us back to reigning in the overly aggressive replicaGC. I have a (ticks-based) WIP in my latest round of experiments.

@bdarnell
Copy link
Contributor

The problem I see here is not that there's throttling at the receiver, but that the throttling will back up the sender, who could spend the time sending snapshots to other nodes that need them.

This could be addressed with either a deadline (to unblock the thread and allow it to try a different range/replica) or more parallelism on the sender (run the queue with a higher concurrency, but then limit it back down with a sender-side semaphore once the recipient has accepted the snapshot reservation).

There's also #14768: Preemptive snapshots are limited to 2MiB/sec while raft snapshots get 8MiB/sec. However, we don't queue them separately, so a higher-priority raft snapshot may be blocked until a throttled preemptive snapshot has completed.

tbg added a commit to tbg/cockroach that referenced this issue Nov 3, 2018
This probably isn't going to be close to the code that will eventually
get checked in, but I wanted to get the conversation started.

I don't have concrete evidence that this problem is a root cause of
\cockroachdb#32046, however I want to address it (at the very least for cockroachdb#31875).

I have to dig in more, but what I'm seeing in various import/tpch flavor
tests is that the split-scatter phase is extremely aggressive in
splitting and then downreplicating overreplicated ranges.

For example, r100 might have descriptor [n1,n2,n3,n4] and will rapidly
be split (and its LHS and RHS split again, multiple times) while, say,
n4 is removed. I think that in this kind of situation, needing one
Raft snapshot quickly implies needing ~O(splits) Raft snapshots.
This is because splitting a range on which one replica requires a
Raft snapshot you end up with two ranges that do.

The implication is that we don't want to need Raft snapshots (and
perhaps also: we want to go easy on splitting ranges for which one
replica already needs a snapshot).

On a recent "successful" run of tpccbench/nodes=11/cpus=32, a spike in
pending snapshots from zero to 5k (resolved within minutes) was
observed. A run of import/tpch/nodes=8 typically shows a rapid increase
from zero to ~1k which only dissipates after the import returns.
This variation may be random, or it may indicate that the import test is
a lot more aggressive for some reason.

I have to look into the details, but the following script results in
a number of Raft snapshots (dozens). This may already be fixed by other
PRs such as cockroachdb#31875, though. Easy to verify.

----

An upreplication begins by sending a preemptive snapshot, followed by
a transaction which "officially" adds the new member to the the Raft group.

This leaves a (typically small) window during which the replicaGC queue
could pick up the preemptive snapshot and delete it. This is unfortunate
as it leaves the range in a fragile state, with one follower requiring a
Raft snapshot to catch up.

This commit introduces a heuristic that holds off on GC'ing replicas
that look like preemptive snapshots until they've been around for a
while.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 6, 2018
This makes it a lot easier to log descriptive debug messages indicating
how a truncation decision was arrived at, and in particular allows
pointing the finger at truncations that lead to Raft snapshots, which
is relevant in the context of cockroachdb#32046.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 12, 2018
This makes it a lot easier to log descriptive debug messages indicating
how a truncation decision was arrived at, and in particular allows
pointing the finger at truncations that lead to Raft snapshots, which
is relevant in the context of cockroachdb#32046.

Release note: None
craig bot pushed a commit that referenced this issue Nov 12, 2018
32137: storage: refactor log truncation index computation r=petermattis a=tschottdorf

This makes it a lot easier to log descriptive debug messages indicating
how a truncation decision was arrived at, and in particular allows
pointing the finger at truncations that lead to Raft snapshots, which
is relevant in the context of #32046.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg
Copy link
Member Author

tbg commented Nov 15, 2018

The log size truncation is definitely still janky. We know it can undercount, but the behavior here is rather the opposite: we set the "max raft log size" to the size of the replica, which is really small (74KiB). The Raft log size includes the overhead of the proposal, etc, which the replica size doesn't eat. So we see an endless stream of premature proposals. I don't think any of these really cause a snapshot (because the latest log entries likely have been sent to the client at this point) but it seems silly, especially since this is a condition I want to be able to keep logging.

logs/cockroach.log:I181115 11:14:10.668506 67875 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5274 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:14:50.657125 68083 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5295 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:15:30.630221 68241 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 3 entries to first index 5313 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:15:40.650090 68348 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5320 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:16:10.675992 68501 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5335 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:16:50.650678 68697 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5355 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:17:10.685852 68769 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5364 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:17:30.656791 68874 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5375 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:17:50.640727 68984 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5385 (chosen via: quorum); log too large (109 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:18:00.642628 69017 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5389 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:18:30.665122 69135 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5405 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:18:50.663119 69218 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5415 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:19:10.656308 69295 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5425 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:19:20.711898 69398 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5430 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:19:50.695048 69503 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5446 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:10.637142 69610 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5456 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:20.638040 69629 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5461 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:30.669153 69667 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5466 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:20:50.657441 69715 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5476 (chosen via: quorum); log too large (109 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:10.662264 69866 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5486 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:30.652276 69959 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5496 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:40.661806 69980 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5501 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:21:50.677477 70048 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5507 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:22:10.680383 70229 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5517 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:22:40.678359 70341 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5532 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:23:10.669672 70450 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5547 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:24:00.672384 70700 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5572 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:24:20.674592 70755 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5582 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:25:00.683632 70983 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5602 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:25:20.654189 71090 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5612 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:26:10.723236 71279 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5640 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:26:30.670779 71415 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 7 entries to first index 5650 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:26:50.648489 71438 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5660 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:27:00.658103 71513 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5665 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:27:10.681703 71594 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5670 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:27:40.652636 71732 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5685 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:28:00.663801 71796 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 5 entries to first index 5695 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:28:10.671749 71886 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 4 entries to first index 5699 (chosen via: quorum); log too large (91 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:29:10.666881 72129 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5732 (chosen via: quorum); log too large (127 KiB > 74 KiB); implies 1 Raft snapshot
logs/cockroach.log:I181115 11:29:20.667809 72171 storage/raft_log_queue.go:356  [n4,raftlog,s4,r4/2:/System/{NodeLive…-tsd}] truncate 6 entries to first index 5738 (chosen via: quorum); log too large (109 KiB > 74 KiB); implies 1 Raft snapshot

@tbg
Copy link
Member Author

tbg commented Nov 15, 2018

The writes to that range are the node statuses, which are large (tens of ks) and inline (i.e. each write replaces a previous write). So the replica size is mostly constant, whereas the Raft log grows fairly quickly. Under the configured behavior, it's doing the right thing, but I would argue that it shouldn't be truncating all the way up to the quorum index quite as aggressively. Truncating to the joint commit index (taking into account live nodes only) would fare better. I added a bullet to that effect in the initial list.

@tbg
Copy link
Member Author

tbg commented Nov 15, 2018

Another bad-looking thing:

I181115 12:25:04.940211 70549 storage/raft_log_queue.go:356  [n1,raftlog,s1,r3440/1:/Table/53/4/90{1444/…-3657/…}] truncate 6 entries to first index 17 (chosen via: quorum); log too large (17 MiB > 0 B)

0B seems to correspond to the replica size. I can only assume that an SST ingestion was proposed but bounced downstream of Raft and the truncation hit the small window before it was reproposed. We shouldn't use a size of zero to justify a truncation. (The "other limit" 4MB would've triggered the same truncation, so this is really a case of doing the right thing for the wrong reason).

Added a bullet for this as well.

tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
cc @nvanbenschoten. I'm going to run some kv95 experiments in which I
vary the 64kb threshold in both directions to see if there are any
effects on performance in doing so.

----

Whenever the "max raft log size" is exceeded, log truncations become
more aggressive in that they aim at the quorum commit index, potentially
cutting off followers (which then need Raft snapshots).

The effective threshold log size is 4mb for replicas larger than 4mb and
the replica size otherwise. This latter case can be problematic since
replicas can be persistently small despite having steady log progress
(for example, range 4 receives node status updates which are large
inline puts). If in such a range a follower falls behind just slightly,
it'll need a snapshot. This isn't in itself the biggest deal since the
snapshot is fairly rare (the required log entries are usually already on
in transit to the follower) and would be small, but it's not ideal.

Always use a 4mb threshold instead. Note that we also truncate the log
to the minimum replicated index if the log size is above 64kb. This is
similarly aggressive but respects followers (until they fall behind by
4mb or more).

My expectation is that this will not functionally change anything. It
might leave behind a little bit more Raft log on quiescent ranges, but I
think the solution here is performing "one last truncation" for ranges
that are quiescent to make sure they shed the remainder of their Raft
log.

Touches cockroachdb#32046.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 19, 2018
Whenever the "max raft log size" is exceeded, log truncations become
more aggressive in that they aim at the quorum commit index, potentially
cutting off followers (which then need Raft snapshots).

The effective threshold log size is 4mb for replicas larger than 4mb and
the replica size otherwise. This latter case can be problematic since
replicas can be persistently small despite having steady log progress
(for example, range 4 receives node status updates which are large
inline puts). If in such a range a follower falls behind just slightly,
it'll need a snapshot. This isn't in itself the biggest deal since the
snapshot is fairly rare (the required log entries are usually already on
in transit to the follower) and would be small, but it's not ideal.

Always use a 4mb threshold instead. Note that we also truncate the log
to the minimum replicated index if the log size is above 64kb. This is
similarly aggressive but respects followers (until they fall behind by
4mb or more).

My expectation is that this will not functionally change anything. It
might leave behind a little bit more Raft log on quiescent ranges, but I
think the solution here is performing "one last truncation" for ranges
that are quiescent to make sure they shed the remainder of their Raft
log.

Touches cockroachdb#32046.

Release note: None
craig bot pushed a commit that referenced this issue Nov 19, 2018
32437: storage: truncate aggressively only after 4mb of logs r=nvanbenschoten,petermattis a=tbg

cc @nvanbenschoten. I'm going to run some kv95 experiments in which I
vary the 64kb threshold in both directions to see if there are any
effects on performance in doing so.

----

Whenever the "max raft log size" is exceeded, log truncations become
more aggressive in that they aim at the quorum commit index, potentially
cutting off followers (which then need Raft snapshots).

The effective threshold log size is 4mb for replicas larger than 4mb and
the replica size otherwise. This latter case can be problematic since
replicas can be persistently small despite having steady log progress
(for example, range 4 receives node status updates which are large
inline puts). If in such a range a follower falls behind just slightly,
it'll need a snapshot. This isn't in itself the biggest deal since the
snapshot is fairly rare (the required log entries are usually already on
in transit to the follower) and would be small, but it's not ideal.

Always use a 4mb threshold instead. Note that we also truncate the log
to the minimum replicated index if the log size is above 64kb. This is
similarly aggressive but respects followers (until they fall behind by
4mb or more).

My expectation is that this will not functionally change anything. It
might leave behind a little bit more Raft log on quiescent ranges, but I
think the solution here is performing "one last truncation" for ranges
that are quiescent to make sure they shed the remainder of their Raft
log.

Touches #32046.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit to tbg/cockroach that referenced this issue Dec 10, 2018
This makes it a lot easier to log descriptive debug messages indicating
how a truncation decision was arrived at, and in particular allows
pointing the finger at truncations that lead to Raft snapshots, which
is relevant in the context of cockroachdb#32046.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 10, 2018
Whenever the "max raft log size" is exceeded, log truncations become
more aggressive in that they aim at the quorum commit index, potentially
cutting off followers (which then need Raft snapshots).

The effective threshold log size is 4mb for replicas larger than 4mb and
the replica size otherwise. This latter case can be problematic since
replicas can be persistently small despite having steady log progress
(for example, range 4 receives node status updates which are large
inline puts). If in such a range a follower falls behind just slightly,
it'll need a snapshot. This isn't in itself the biggest deal since the
snapshot is fairly rare (the required log entries are usually already on
in transit to the follower) and would be small, but it's not ideal.

Always use a 4mb threshold instead. Note that we also truncate the log
to the minimum replicated index if the log size is above 64kb. This is
similarly aggressive but respects followers (until they fall behind by
4mb or more).

My expectation is that this will not functionally change anything. It
might leave behind a little bit more Raft log on quiescent ranges, but I
think the solution here is performing "one last truncation" for ranges
that are quiescent to make sure they shed the remainder of their Raft
log.

Touches cockroachdb#32046.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Dec 11, 2018
We know there can be a backlog of Raft snapshots at the beginning of the
test. This isn't ideal, but we know about it and have cockroachdb#32046 tracking
it.

Closes cockroachdb#32859.

Release note: None
craig bot pushed a commit that referenced this issue Dec 11, 2018
33011: roachtest: don't fail tests based on slow health checker r=petermattis a=tbg

We know there can be a backlog of Raft snapshots at the beginning of the
test. This isn't ideal, but we know about it and have #32046 tracking it.

Closes #32859.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg tbg added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) X-stale
Projects
None yet
Development

No branches or pull requests

5 participants